Diffusion-based Unsupervised Audio-visual Speech Enhancement

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Diffusion-based Unsupervised Audio-visual Speech Enhancement, by Jean-Eudes Ayilo (MULTISPEECH) and 3 other authors

View PDF

Abstract:This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach that combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. First, the diffusion model is pre-trained on clean speech conditioned on corresponding video data to simulate the speech generative distribution. This pre-trained model is then paired with the NMF-based noise model to estimate clean speech iteratively. Specifically, a diffusion-based posterior sampling approach is implemented within the reverse diffusion process, where after each iteration, a speech estimate is obtained and used to update the noise parameters. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method. Additionally, the new inference algorithm offers a better balance between inference speed and performance compared to the previous diffusion-based method. Code and demo available at: this https URL

Submission history

From: Mostafa SADEGHI [view email] [via CCSD proxy]
[v1]
Fri, 4 Oct 2024 12:22:54 UTC (303 KB)
[v2]
Wed, 15 Jan 2025 09:42:42 UTC (308 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.