NVIDIA has debuted a new experimental generative AI model, which it describes as “a Swiss Army knife for sound.” The model called Foundational Generative Audio Transformer Opus 1, or Fugatto, can take commands from text prompts and use them to create audio or to modify existing music, voice and sound files. It was designed by a team of AI researchers from around the world, and NVIDIA says that made the model’s “multi-accent and multilingual capabilities stronger.”
“We wanted to create a model that understands and generates sound like humans do,” said Rafael Valle, one of the researchers behind the project and a manager of applied audio research at NVIDIA. The company listed some possible real-world scenarios wherein Fugatto could be of use in its announcement. Music producers, it suggested, could use the technology to quickly generate a prototype for a song idea, which they can then easily edit to try out different styles, voices and instruments.
People could use it to generate materials for language learnings tools in the voice of their choice. And video game developers could use it to create variations of pre-recorded assets to fit changes in the game based on the players’ choices and actions. In addition, the researchers found that the model can accomplish tasks not part of its pre-training, with some fine-tuning. It could combine instructions that it was trained on separately, such as generating speech that sounds angry with a specific accent or the sound of birds singing during a thunderstorm. The model can generate sounds that change over time, as well, like the pounding of a rainstorm as it moves across the land.
NVIDIA didn’t say if it will give the public access to Fugatto, but the model isn’t the first generative AI technology that can create sounds out of text prompts. Meta previously released an open source AI kit that can create sounds from text descriptions. Google has its own text-to-music AI called MusicLM that people can access through the company’s AI Test Kitchen website.
Source link
lol