Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition

AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning


View a PDF of the paper titled Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition, by Zixuan Wang and 2 other authors

View PDF
HTML (experimental)

Abstract:We introduce Audio-Agent, a multimodal framework for audio generation, editing and composition based on text or video inputs. Conventional approaches for text-to-audio (TTA) tasks often make single-pass inferences from text descriptions. While straightforward, this design struggles to produce high-quality audio when given complex text conditions. In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4, which decomposes the text condition into atomic, specific instructions and calls the agent for audio generation. In doing so, Audio-Agent can generate high-quality audio that is closely aligned with the provided text or video exhibiting complex and multiple events, while supporting variable-length and variable-volume generation. For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio, a process that can be tedious and time-consuming. Instead, we propose a simpler approach by fine-tuning a pre-trained Large Language Model (LLM), e.g., Gemma2-2B-it, to obtain both semantic and temporal conditions that bridge the video and audio modality. Consequently, our framework contributes a comprehensive solution for both TTA and VTA tasks without substantial computational overhead in training.

Submission history

From: Zixuan Wang [view email]
[v1]
Fri, 4 Oct 2024 11:40:53 UTC (1,531 KB)
[v2]
Tue, 14 Jan 2025 11:59:03 UTC (1,842 KB)



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.