The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language

stp2yJanuary 9, 20250 Comments

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning

[Submitted on 12 Sep 2024 (v1), last revised 7 Jan 2025 (this version, v3)]

View a PDF of the paper titled The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language, by Michael Ong and Sean Robertson and Leo Peckham and Alba Jorquera Jimenez de Aberasturi and Paula Arkhangorodsky and Robin Huo and Aman Sakhardande and Mark Hallap and Naomi Nagy and Ewan Dunbar

View PDF
HTML (experimental)

Abstract:We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark corpus designed to push the limits of current approaches to low-resource speech recognition. Faetar, a Franco-Provençal variety spoken primarily in Italy, has no standard orthography, has virtually no existing textual or speech resources other than what is included in the benchmark, and is quite different from other forms of Franco-Provençal. The corpus comes from field recordings, most of which are noisy, for which only 5 hrs have matching transcriptions, and for which forced alignment is of variable quality. The corpus contains an additional 20 hrs of unlabelled speech. We report baseline results from state-of-the-art multilingual speech foundation models with a best phone error rate of 30.4%, using a pipeline that continues pre-training on the foundation model using the unlabelled set.

Submission history

From: Ewan Dunbar [view email]
[v1]
Thu, 12 Sep 2024 14:55:33 UTC (5,048 KB)
[v2]
Sun, 6 Oct 2024 01:32:03 UTC (5,048 KB)
[v3]
Tue, 7 Jan 2025 15:32:33 UTC (5,030 KB)

Source link
lol

By stp2y