Exclusive: Camb takes on ElevenLabs with open voice cloning AI model Mars5 offering higher realism, support for 140 languages

It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More

Today, Dubai-based Camb AI, a startup researching AI-driven content localization technologies, announced the release of Mars5, a powerful AI model for voice cloning.

While there are plenty of models that can create digital voice replicas, including those from ElevenLabs, Camb claims to differentiate by offering a much higher level of realism with Mars5’s outputs.

According to early samples shared by the company, the model not only emulates the original voice but also its complex prosodic parameters, including rhythm, emotion and intonation.

Camb also supports nearly 3 times as many languages as ElevenLabs: more than 140 languages compared to ElevenLabs’ 36, including low-resource ones like Icelandic and Swahili. However, the open-sourced technology, which can be accessed on GitHub starting today, is only the English-specific version. The version with expanded language support is available on the company’s paid Studio.

VB Transform 2024 Registration is Open

Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now

“The level of prosody and realism that Mars5 is able to capture, even with just a few seconds of input, is unprecedented. This is a mistral moment in speech,” Akshat Prakash, the co-founder and CTO of the company, said in a statement.

Emulating voices with prosody

Normally, voice cloning and text-to-speech conversion are two separate offerings. The former captures parameters from a given voice sample to create a voice clone while the latter uses that clone to convert any given text into synthetic speech. The technology, as we have seen in the past, has the potential to portray anyone as speaking anything.

With Mars5, Camb AI is taking the work ahead by mixing both capabilities into a unified platform. All a user has to do is upload an audio file, ranging between a few seconds to a minute, and provide the text content. The model will then use the speaker’s voice in the audio file as a reference, capture the relevant details – including the original voice, speaking style, emotion, enunciation and meaning – and synthesize the provided text as speech using it.

The company claims Mars5 can capture diverse emotional tones and pitches, covering all sorts of complex speech scenarios such as when a person is frustrated, commanding, calm or even spirited. This, Prakash noted, makes it suitable for content that has been traditionally difficult to convert into speech such as sports commentary, movies, and anime.

To achieve this level of prosody, Mars5 combines a Mistral-style ~750M parameter autoregressive model with a novel ~450M parameter non-autoregressive multinomial diffusion model, operating on 6kbps encodec tokens.

“The AR model iteratively predicts the most coarse (lowest level) codebook value for the encodec features, while the NAR model takes the AR output and infers the remaining codebook values in a discrete denoising diffusion task. Specifically, the NAR model is trained as a DDPM using a multinomial distribution on encodec features, effectively ‘inpainting’ the remaining codebook entries after the AR model has predicted the coarse codebook values,” Prakash explained.

Better than other text-to-speech and voice cloning models?

While specific benchmark stats are yet to be seen, early samples and tests (with a few seconds of reference audio) run by VentureBeat show that the model mostly performs better than popular open and closed-source speech synthesis models, including those from Metavoice and ElevenLabs. The competitive offerings synthesized speech clearly but the results didn’t sound as similar to the original voice as they did in the case of Mars5.

“ElevenLabs is closed source so it’s hard to specifically say why they aren’t able to capture nuances that we can, but given that they report training on 500K+ hours (almost 5 times the dataset we have in English), it is clear to us that we have a superior model design that learns speech and its nuances better than theirs. Of course, as our datasets continue to grow and Mars5 trains even more, which we will release in successive checkpoints in Github, we expect it to only get better and better and better, especially considering support from the open-source community,” the CTO added.

As the company continues to bolster the voice cloning and text-to-speech performance of Mars5, it is also planning the open-source release of another model called Boli. This one has been designed to enable translation with contextual understanding, correct grammar and apt colloquialism.

“Boli is our proprietary translation model, which surpasses traditional engines such as Google Translate and DeepL in capturing the nuances and colloquial aspects of language. Unlike large-scale parallel corpus-based systems, Boli offers a more consistent and natural translation experience, particularly in low- to medium-resource languages. Feedback from clients indicates that Boli’s translations outperform those produced by mainstream tools, including the latest generative models like ChatGPT,” Prakash said.

Currently, both Mars5 and Boli work with 140 languages on the Camb’s proprietary platform Camb Studio. The company is also providing these capabilities as APIs to enterprises, SMEs and developers. Prakash did not share the exact number of customers but he did point out the company is working with Major League Soccer, Tennis Australia, Maple Leaf Sports & Entertainment as well as leading movie and music studios and several government agencies.

For Major League Soccer, Camb AI live-dubbed a game into four languages in parallel for over 2 hours, uninterrupted – becoming the first company to do so. It also translated the Australian Open’s post-match conference into multiple languages and translated the psychological thriller “Three” from Arabic to Mandarin.

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link lol

Exclusive: Camb takes on ElevenLabs with open voice cloning AI model Mars5 offering higher realism, support for 140 languages

Emulating voices with prosody

Better than other text-to-speech and voice cloning models?

By stp2y

Leave a Reply Cancel reply