The line separating human intelligence from artificial intelligence just got more narrow.
OpenAI on Thursday revealed o1, the first in a new series of AI models that are “designed to spend more time thinking before they respond,” the company said in a blog post.
The new model can work through complex tasks and, in comparison to previous models, solve more difficult problems in science, coding, and math. In essence, they think a little more like humans than existing AI chatbots.
While previous iterations of OpenAI’s models have excelled on standardized tests like the SAT to the Uniform Bar Examination, the company says that o1 goes a step further. It performs “similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology.”
For example, it beat GPT-4o — a multimodal model OpenAI unveiled in May — in the qualifying exam for the International Mathematics Olympiad by a long shot. GPT-4o only correctly solved 13% of the exam’s problems, while o1 scored 83%, the company said.
The sharp surge in the o1’s reasoning capabilities comes, in part, from a prompting technique known as “chain of thought.” OpenAI said o1 “learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.”
That’s not to say there aren’t some tradeoffs compared to earlier models. OpenAI noted that while human testers preferred o1’s responses in reasoning-heavy categories like data analysis, coding, and math, GPT-4o still won out in natural language tasks like personal writing.
OpenAI’s primary mission has long been to create artificial general intelligence, or AGI, a still hypothetical form of AI that mimics human capabilities. Over the summer, while o1 was still in development, the company unveiled a new five-level classification system for tracking its progress toward that goal. Company executives reportedly told employees that o1 was nearing a level two, which it identified as “reasoners” with human-level problem-solving.
Ethan Mollick, a professor at the University of Pennsylvania’s Wharton School who has had access to o1 for over a month, said the model’s gains are perhaps best illustrated by how it solves crossword puzzles. Crossword puzzles are typically difficult for large language models to solve because “they require iterative solving: trying and rejecting many answers that all affect each other,” Mollick wrote in a post on his Substack. Most large language models “can only add a token/word at a time to their answer.”
But when Mollick asked o1 to solve a crossword puzzle, it thought about it for a “full 108 seconds” before responding. He said that its thoughts were both “illuminating” and “pretty impressive” even if they weren’t fully correct.
Other AI experts, however, are less convinced.
Gary Marcus, a New York University professor of cognitive science, told Business Insider that the model is “impressive engineering” but not a giant leap. “I am sure it will be hyped to the sky, as usual, but it’s definitely not close to AGI,” he said.
Since OpenAI unveiled GPT-4 last year, it’s been releasing successive iterations in its quest to invent AGI. In April, GPT-4 Turbo was made available to paid subscribers. One update included the ability to generate responses that are “more conversational.”
The company announced in July that it’s testing an AI search product called SearchGPT with a limited group of users.
Source link
lol