CountCLIP — [Re] Teaching CLIP to Count to Ten

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


[Submitted on 5 Jun 2024]

View a PDF of the paper titled CountCLIP — [Re] Teaching CLIP to Count to Ten, by Harshvardhan Mestha and 4 other authors

View PDF

Abstract:Large vision-language models (VLMs) are shown to learn rich joint image-text representations enabling high performances in relevant downstream tasks. However, they fail to showcase their quantitative understanding of objects, and they lack good counting-aware representation. This paper conducts a reproducibility study of ‘Teaching CLIP to Count to Ten’ (Paiss et al., 2023), which presents a method to finetune a CLIP model (Radford et al., 2021) to improve zero-shot counting accuracy in an image while maintaining the performance for zero-shot classification by introducing a counting-contrastive loss term. We improve the model’s performance on a smaller subset of their training data with lower computational resources. We verify these claims by reproducing their study with our own code. The implementation can be found at this https URL.

Submission history

From: Harshvardhan Mestha Mr [view email]
[v1]
Wed, 5 Jun 2024 19:05:08 UTC (7,529 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.