Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Today, Israeli AI startup aiOla announced the launch of a new, open-source speech recognition model that is 50% faster than OpenAI’s famous Whisper.
Officially dubbed Whisper-Medusa, the model builds on Whisper but uses a novel “multi-head attention” architecture that predicts far more tokens at a time than the OpenAI offering. Its code and weights have been released on Hugging Face under an MIT license that allows for research and commercial usage.
“By releasing our solution as open source, we encourage further innovation and collaboration within the community, which can lead to even greater speed improvements and refinements as developers and researchers contribute to and build upon our work,” Gill Hetz, aiOla’s VP of research, tells VentureBeat.
The work could pave the way to compound AI systems that could understand and answer whatever users ask in almost real time.
What makes aiOla Whisper-Medusa unique?
Even in the age of foundation models that can produce diverse content, advanced speech recognition remains highly relevant. The technology is not only driving key functions across sectors like healthcare and fintech – helping with tasks like transcription – but also powering very capable multimodal AI systems. Last year, category-leader OpenAI embarked on this journey by tapping its own Whisper model. It converted user audio into text, allowing an LLM to process the query and provide the answer, which was again converted back to speech.
Due to its ability to process complex speech with different languages and accents in almost real-time, Whisper has emerged as the gold standard in speech recognition, witnessing more than 5 million downloads every month and powering tens of thousands of apps.
But, what if a model can recognize and transcribe speech even faster than Whisper? Well, that’s what aiOla claims to have achieved with the new Whisper-Medusa offering — paving the way for more seamless speech-to-text conversions.
To develop Whisper-Medusa, the company modified Whisper’s architecture to add a multi-head attention mechanism — known for allowing a model to jointly attend to information from different representation subspaces at different positions by using multiple “attention heads” in parallel. The architecture change enabled the model to predict ten tokens at each pass rather than the standard one token at a time, ultimately resulting in a 50% increase in speech prediction speed and generation runtime.
More importantly, since Whisper-Medusa’s backbone is built on top of Whisper, the increased speed does not come at the cost of performance. The novel offering transcribes text with the same level of accuracy as the original Whisper. Hetz noted they are the first ones in the industry to successfully apply the approach to an ASR model and open it to the public for further research and development.
“Improving the speed and latency of LLMs is much easier to do than with automatic speech recognition systems. The encoder and decoder architectures present unique challenges due to the complexity of processing continuous audio signals and handling noise or accents. We addressed these challenges by employing our novel multi-head attention approach, which resulted in a model with nearly double the prediction speed while maintaining Whisper’s high levels of accuracy,” he said.
How the speech recognition model was trained?
When training Whisper-Medusa, aiOla employed a machine-learning approach called weak supervision. As part of this, it froze the main components of Whisper and used audio transcriptions generated by the model as labels to train additional token prediction modules.
Hetz told VentureBeat they have started with a 10-head model but will soon expand to a larger 20-head version capable of predicting 20 tokens at a time, leading to faster recognition and transcription without any loss of accuracy.
“We chose to train our model to predict 10 tokens on each pass, achieving a substantial speedup while retaining accuracy, but the same approach can be used to predict any arbitrary number of tokens in each step. Since the Whisper model’s decoder processes the entire speech audio at once, rather than segment by segment, our method reduces the need for multiple passes through the data and efficiently speeds things up,” the research VP explained.
Hetz did not say much when asked if any company has early access to Whisper-Medusa. However, he did point out that they have tested the novel model on real enterprise data use cases to ensure it performs accurately in real-world scenarios. Eventually, he believes improvement in recognition and transcription speeds will allow for faster turnaround times in speech applications and pave the way for providing real-time responses. Imagine Alexa recognizing your command and returning the expected answer in a matter of seconds.
“The industry stands to benefit greatly from any solution involving real-time speech-to-text capabilities, like those in conversational speech applications. Individuals and companies can enhance their productivity, reduce operational costs, and deliver content more promptly,” Hetz added.
Source link lol