Was this newsletter forwarded to you? Sign up to get it in your inbox.
I’m at OpenAI’s developer conference, DevDay, today in San Francisco. Here’s what I saw.
The big news is that the company launched a Realtime API that promises to allow anyone to build functionality similar to ChatGPT’s Advanced Voice Mode within their own app. Paired with their new model o1, released a few weeks ago, OpenAI is creating an new way to build software.
o1 can prototype anything you have in your head in minutes rather than months. And the Realtime API enables developers to build software with a novel interface—life-like voice conversations—that was previously only the domain of science fiction.
OpenAI also announced the following:
- An increase of the API rate limits on the o1 model equal to that of GPT-4o (10,000 requests per minute)
- A reduction of the price of GPT-4o API calls with automatic prompt caching, making repeated API calls 50 percent cheaper with no extra developer effort
- A multi-modal fine-tuning API that allows developers to fine-tune GPT-4o with images in addition to text
- Triple the number of active apps are on the OpenAI platform from last year to this year, and there are 3 million active developers.
Let’s get into the details.
o1 victory lap
OpenAI released its new reasoning model o1 two and a half weeks ago, and the company’s excitement about it was palpable in the room. OpenAI’s head of product, API Olivier Godement described it as a new family of models, distinct from GPT-4o, which is why they reset the number on the model back to one.
At the beginning of the AI wave there was a lot of talk about what artificial general intelligence would look like: Would it be one model to rule them all—GPT-5 for all comers—or would there be different models for different purposes? Today OpenAI told developers that they would be investing heavily in both next-generation GPT-4o and o1-type models. The company is betting on a diversity of models for different use cases as the way forward.
o1 models excel at reasoning—which OpenAI defines as being able to think in a chain of thought format—which makes them better at tasks like programming, but slower and more expensive. These models will be used for tasks that require more advanced reasoning, but they won’t become the default model choice because most prompts won’t require it.
The increase in programming power was on display today. Romain Huet, OpenAI’s head of developer relations, did a live demo where he used o1 to build an iPhone app end-to-end with a single prompt in less than 30 seconds. He also demonstrated building a web app to control a drone that he brought on stage, and used it to pilot the drone for the audience.
It would’ve been possible to do these demos with previous GPT models, but they would’ve taken much longer to build, and probably wouldn’t have been suitable for a live audience.
With o1, OpenAI is previewing a future where you can go from idea to app in a minute or two.
Realtime API with speech-to-speech
The most impressive thing OpenAI launched, by far, is its Realtime API. As I mentioned earlier, it allows developers to build Advanced Voice Mode-type capabilities into their own apps with a speech-to-speech API.
Developers will be able to send recorded audio into OpenAI’s servers and get back a recorded response, a transcript, and function calls all in real time. Rollout for the Realtime API starts today in public beta (here’s the documentation if you want to use it). They also teased that more modalities are coming to the real-time API, like video.
The Realtime API is expensive: According to OpenAI, it will cost approximately $0.06 per minute of audio input and $0.24 per minute of audio output, for a total of $0.15 per minute (assuming equal audio input and output). This is more expensive than the ElevenLabs speech-to-speech offering, which costs about $0.11 per minute, but it’s not pay-per-use—you have to buy a certain number of minutes per month.
Real-time voice opens up a ton of new use cases—like better reading companions and more immersive language tutoring—and I’ll have new experiments to show you over the next several weeks.
Fine-tuning tools
Because OpenAI is taking seriously the idea that multiple models are going to be better than one big one, the company is extending its efforts to allow companies to customize their own versions of GPT-4o. At DevDay, executives said that they envision a future where every company has their own fine-tuned model—one that’s refined for its particular use case and has access to its data.
To that end, they launched a few new things:
An image fine-tuning API
Anyone can use their image data to fine-tune GPT-4o. For example, if you work in healthcare and you want to fine-tune GPT-4o’s ability to read and label MRIs, you can use this API to do that.
Model distillation tools
OpenAI is also releasing two tools to enable better model distillation—the process of creating a smaller, faster, cheaper version of a foundation model built for specific use cases.
To make distillation easier, they added the ability to record previous API interactions and use them as data for fine-tuning in their developer playground. They also added an Evals tool to their playground to allow developers to evaluate the performance of their fine-tuned models. You can read more about that here.
Prompt caching makes repeated API calls 50 percent cheaper
OpenAI also launched a new prompt caching feature that detects repeated API calls and returns previously generated responses. It does this automatically starting today, making many API calls 50 percent cheaper with no extra developer work.
This feature is a continuation of OpenAI’s trend of racing to make using its API cheaper and cheaper. It’s great for developers, but it also creates an interesting dynamic with its biggest partner, Microsoft. I heard from DevDay attendees that Microsoft has been pushing large enterprises to commit up front to buy a certain amount of GPT-4 API calls in order to guarantee capacity. One wonders how Microsoft—and its customers who have already committed—feel about these price reductions.
What comes next?
I’ll be at DevDay for the rest of the day watching demos and recording videos. I’ll have a YouTube video up later with a more in-depth look at everything I saw, and I’ll publish a longer reflection later in the week with some more analysis.
The first big takeaway for me is that OpenAI is leaning into the race to build different kinds of models for different use cases. The company believes that the most effective applications are going to string together multiple models rather than use one for everything. For example, developers might use one model that is good at reasoning—like o1—and another model that’s good at, say, handling long context or image prompts—like GPT-4o—to create a cohesive experience for users.
The second big takeaway is that OpenAI believes that o1 is an important step toward agents that can autonomously do work without significant user input. Agents have long been one of the sexiest AI applications, but previous GPT models were likely to get off track if they were left to try to figure out a task by themselves. o1, because of its ability to reflect on its own thought processes and plan next steps, is a key pillar in making agents that are actually autonomous.0
The third, and biggest, takeaway is just how much technology is now available for developers to build amazing experiences for users. It’s easy to forget that just a few years ago, none of the things that were shown today were possible, or even on anyone’s radar. Today, a single developer making an app in their spare time can build things that entire teams of developers wouldn’t have been capable of previously.
Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn, and Every on X at @every and on LinkedIn.
Source link
lol