Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
OpenAI updated its Realtime API today, which is currently in beta. This update adds new voices for speech-to-speech applications to its platform and cuts costs associated with caching prompts.
Beta users of the Realtime API will now have five new voices they can use to build their applications. OpenAI showcased three of the new voices, Ash, Verse and the British-sounding Ballad, in a post on X.
Two Realtime API updates:
– You can now build speech-to-speech experiences with five new voices—which are much more expressive and steerable. ???
– We’re lowering the price by using prompt caching. Cached text inputs are discounted 50% and cached audio inputs are discounted… pic.twitter.com/jLzZDBrR7l
— OpenAI Developers (@OpenAIDevs) October 30, 2024
The company said in its API documentation that the native speech-to-speech feature “skip[s] an intermediate text format means low latency and nuanced output,” while the voices are easier to steer and more expressive than its previous voices.
However, OpenAI warns it cannot offer client-side authentication for the API now as it’s still in beta. It also said that there may be issues with processing real-time audio.
“Network conditions heavily affect real-time audio, and delivering audio reliably from a client to a server at scale is challenging when network conditions are unpredictable,” the company shared.
OpenAI’s history with AI-powered speech and voices has been controversial. In March, it released Voice Engine, a voice cloning platform to rival ElevenLabs, but it limited access to only a few researchers. In May, after the company demoed its GPT-4o and Voice Mode, it paused using one of the voices, Sky, after the actress Scarlett Johansson spoke out about its similarity to her voice.
The company rolled out ChatGPT Advanced Voice Mode for paying subscribers (those using ChatGPT Plus, Enterprise, Teams and Edu) in the U.S. in September.
Speech-to-speech AI would ideally let enterprises build more real-time responses using a voice. Suppose a customer calls a company’s customer service platform. In that case, the speech-to-speech capability can take the person’s voice, understand what they are asking, and respond using an AI-generated voice with lower latency. Speech-to-speech also lets users generate voice-overs, with a user speaking their lines, but the voice output is not theirs. One platform that offers this is Replica and, of course, ElevenLabs.
OpenAI released the Realtime API this month during its Dev Day. The API aims to speed up the building of voice assistants.
Lowering costs
Using speech-to-speech features, though, could get expensive.
When Realtime API launched, the pricing structure was at $0.06 per minute of audio input and $0.24 per audio output, which is not cheap. However, the company plans to lower real-time API prices with prompt caching.
Cached text inputs will drop by 50%, and cached audio inputs will be discounted by 80%.
OpenAI also announced Prompt Caching during Dev Day and would keep frequently requested contexts and prompts in the model’s memory. This will drop the number of tokens it needs to create to generate responses. Lowering input prices, could encourage more interested developers to connect to the API.
OpenAI is not the only company to roll out Prompt Caching. Anthropic launched prompt caching for Claude 3.5 Sonnet in August.
Source link lol