Don’t miss OpenAI, Chevron, Nvidia, Kaiser Permanente, and Capital One leaders only at VentureBeat Transform 2024. Gain essential insights about GenAI and expand your network at this exclusive three day event. Learn More
Speech recognition is a critical part of multimodal AI systems. Most enterprises are racing to implement the technology, but even after all the advancements to date, many speech recognition models out there can fail to understand what a person is saying. Today, aiOla, an Israeli startup specializing in this field, took a major step towards solving this problem by announcing an approach that teaches these models to understand industry-specific jargon and vocabulary.
The development enhances the accuracy and responsiveness of speech recognition systems, making them more suitable for complex enterprise settings – even in challenging acoustic environments. As an initial case study, the startup adapted OpenAI’s famous Whisper model with its technique, reducing its word error rate and improving overall detection accuracy.
However, it says it can work with any speech rec model, including Meta’s MMS model and proprietary models, unlocking the potential to elevate even the highest-performing speech-to-text models.
The problem of jargon in speech recognition
Over the last few years, deep learning on hundreds of thousands of hours of audio has enabled the rise of high-performing automatic speech recognition (ASR) and transcription systems. OpenAI’s Whisper, one such breakthrough model, made particular headlines in the field with its ability to match human-level robustness and accuracy in English speech recognition.
Countdown to VB Transform 2024
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
However, since its launch in 2022, many have noted that despite being as good as a human listener, Whisper’s recognition performance could decline when applied to audio from complex, real-world environmental conditions. Imagine safety alerts from workers with continuous noise of heavy machinery in the background, activation prompts from people in public spaces or commands with specific utterances and terminology such as those commonly used in medical or legal domains.
Most organizations using state-of-the-art ASR models (Whisper and others) have tried solving this problem with training tailored to their industry’s unique requirements. The approach does the job but can easily end up taking a toll on the company’s financial and human resources.
“Fine-tuning ASR models takes days and thousands of dollars — and that’s only if you already have the data. If you don’t, then it’s a whole other ballgame. Collecting and labeling audio data could take months and cost many tens of thousands of dollars. For example, if you want to fine-tune your ASR model to recognize a vocabulary of 100 industry-specific terms and jargon, you’d need thousands of audio examples in various settings that would all need to be manually transcribed. If afterward, you wanted to add to your model just one new keyword, then you’d have to retrain on new examples,” Gil Hetz, VP of research at aiOla, told VentureBeat.
To solve this, the startup came up with a two-step “contextual biasing” approach. First, the company’s AdaKWS keyword spotting model identifies domain-specific and personalized jargon (pre-defined in a list of jargon) from a given speech sample. Then, these identified keywords are utilized to prompt the ASR decoder, guiding it to incorporate them into the final transcribed text. This augments the model’s overall speech recognition capability, adapting it to correctly detect the jargon or terms in question.
In the initial tests for keyword-based contextual biasing, aiOla used Whisper – the best model in the category – and tried two techniques to improve its performance. The first, termed KG-Whisper or keyword-guided Whisper, finetuned the entire set of decoder parameters, while the second, termed KG-Whisper-PT or prompt tuning, used only some 15K trainable parameters — thereby being more efficient. In both cases, the adapted models were found to be performing better than the original Whisper baselines on various datasets, even in challenging acoustic environments.
“Our new model (KG-Whisper-PT) significantly improves on the Word Error Rate (WER) and overall accuracy (F1 score) compared to Whisper. When tested on a medical dataset highlighted in our research, it achieved a higher F1 score of 96.58 versus Whisper’s 80.50, and a lower word error rate of 6.15 compared to Whisper’s 7.33,” Hertz said.
Most importantly, the approach works with different models. aiOla used it with Whisper but enterprises can use it with any other ASR model they have – from Meta’s MMS and proprietary speech-to-text models – to enable a bespoke recognition system, with zero retraining overhead. All they have to do is provide the list of their industry-specific words to the keyword spotter and keep updating it from time to time.
“The combination of these models gives full ASR capabilities that can accurately identify jargon. It allows us to instantly adapt to different industries by swapping out jargon vocabularies without retraining the entire system. This is essentially a zero-shot model, capable of making predictions without having seen any specific examples during training,” Hertz explained.
Saving time for Fortune 500 enterprises
With its adaptability, the approach can come in handy across a range of industries involving technical jargon, right from aviation, transportation and manufacturing to supply chain and logistics. AiOla, on its part, has already started deploying its adaptive model with Fortune 500 enterprises, increasing their efficiency at handling jargon-heavy processes.
“One of our customers, a Fortune 50 global shipping and logistics leader, needed to conduct daily truck inspections before deliveries. Previously, each inspection took around 15 minutes per vehicle. With an automated workflow powered by our new model, this time went down to under 60 seconds per vehicle. Similarly, one of Canada’s leading grocers used our models to inspect product and meat temperatures as required by health departments. This led to time savings that are projected to reach 110,000 hours saved annually, more than $2.5 million in expected savings, and a 5X ROI,” Hertz noted.
aiOla has published the research for its novel approach with the hope that other AI research teams will build on its work. However, as of now, the company is not providing API access to the adapted model or releasing the weights. The only way enterprises can use it is through the company’s product suite, which operates on a subscription-based pricing structure.
Source link lol