Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts

stp2yDecember 23, 20240 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 20 Dec 2024]

View a PDF of the paper titled Deciphering the Underserved: Benchmarking LLM OCR for Low-Resource Scripts, by Muhammad Abdullah Sohail and 2 other authors

View PDF
HTML (experimental)

Abstract:This study investigates the potential of Large Language Models (LLMs), particularly GPT-4o, for Optical Character Recognition (OCR) in low-resource scripts such as Urdu, Albanian, and Tajik, with English serving as a benchmark. Using a meticulously curated dataset of 2,520 images incorporating controlled variations in text length, font size, background color, and blur, the research simulates diverse real-world challenges. Results emphasize the limitations of zero-shot LLM-based OCR, particularly for linguistically complex scripts, highlighting the need for annotated datasets and fine-tuned models. This work underscores the urgency of addressing accessibility gaps in text digitization, paving the way for inclusive and robust OCR solutions for underserved languages.

Submission history

From: Muhammad Abdullah Sohail [view email]
[v1]
Fri, 20 Dec 2024 18:05:22 UTC (1,281 KB)

Source link
lol

By stp2y