MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Every’s Master Plan


[Submitted on 9 Sep 2024]

View a PDF of the paper titled MessIRve: A Large-Scale Spanish Information Retrieval Dataset, by Francisco Valentini and 5 other authors

View PDF
HTML (experimental)

Abstract:Information retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, current IR benchmarks lack Spanish data, hindering the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with around 730 thousand queries from Google’s autocomplete API and relevant documents sourced from Wikipedia. MessIRve’s queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.

Submission history

From: Francisco Valentini [view email]
[v1]
Mon, 9 Sep 2024 18:45:04 UTC (6,973 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.