OpenAI’s new o3 artificial intelligence model achieved a groundbreaking high score in the ARC Challenge, a prestigious AI inference test, leading some AI fans to speculate that o3 has achieved artificial general intelligence (AGI). However, while ARC Challenge organizers described o3’s achievement as a significant milestone, they cautioned that it did not win the competition’s grand prize. They warned that this is just one step on the road to AGI, a term referring to hypothetical future AI with human-like intelligence. .
The o3 model is the latest in a line of AI releases that follow the large-scale language models that support ChatGPT. “This is a surprising and significant step-up in AI capabilities, demonstrating new task-adaptive capabilities never before seen in GPT family models,” he said. François CholetEngineer at Google and main creator of the ARC Challenge. blog post.
What did OpenAI’s o3 model actually do?
Designed by Chollet Abstraction and inference corpora (ARC) The 2019 challenge tests how well AI can find the right pattern to connect pairs of colored grids. These visual puzzles are intended to demonstrate a form of general intelligence in which AI possesses basic reasoning abilities. However, if you apply enough computing power to a puzzle, even non-inferential programs can solve the problem through brute force. To prevent this, competitions also require official score submission to meet certain limits on computing power.
OpenAI’s newly announced o3 model, scheduled for release in early 2025, achieved an official breakthrough score of 75.7% in the ARC Challenge’s “semi-private” test, which is used to rank competitors on a public leaderboard. The computational cost to achieve this was approximately $20 for each visual puzzle task, meeting the contest limit of less than $10,000 total. However, the more difficult “private” tests used to determine grand prize winners have much stricter computing power limits. This equates to spending just 10 cents on each task that OpenAI fails to meet.
The o3 model also achieved an unofficial score of 87.5% by applying approximately 172 times more computing power than the official score. For comparison, a typical human score is 84%, and if the model can keep computational costs within required limits, an 85% score is enough to win the $600,000 grand prize of the ARC Challenge.
But to arrive at an unofficial score, o3’s cost was in the thousands of dollars to solve each task. OpenAI asked challenge organizers not to disclose the exact computing costs.
Does this o3 achievement mean that AGI has been reached?
No, the ARC Challenge organizers specifically stated that beating this competitive benchmark is not considered an indicator of achieving AGI.
Mike Knoop, ARC Challenge organizer for software company Zapier, said on social media that despite OpenAI applying a very large amount of computing power to the unofficial score, the o3 model failed to solve the task of more than 100 visual puzzles. mail At X.
on social media mail In Blue Sky Melanie Mitchell The Santa Fe Institute in New Mexico had this to say about o3’s progress on the ARC benchmark: “I think solving these tasks with brute-force computing defeats the purpose.”
“The new model is very impressive and represents a big milestone towards AGI, but I don’t think it is AGI. There are still quite a few very easy methods. [ARC Challenge] This is a task that o3 cannot solve,” Chollet said in another X. mail.
But Chollet explained how we can tell if human-level intelligence has been demonstrated by some form of AGI. “You will know that AGI is here when it becomes impossible to create tasks that are easy for regular humans but difficult for AI,” he said in a blog post.
thomas dietrich Oregon State University proposes another way to recognize AGI. “We claim that these architectures contain all the functional components necessary for human cognition,” he says. “By this measure, commercial AI systems lack episodic memory, planning, logical reasoning, and most importantly, metacognitive functions.”
So what does a high score on o3 really mean?
The o3 model’s high score is a result of the technology industry and AI researchers’ judgment that the pace of progress for modern AI models in 2024 will be slower compared to the initial explosion of progress in 2023.
Although it didn’t win the ARC Challenge, o3’s high scores indicate that its AI model can beat competitive benchmarks in the near future. In addition to the unofficial top scores, Chollet says many official low-computing submissions have already scored above 81% on the private evaluation test set.
Dietterich also believes that “this is a very impressive leap forward in performance.” But he cautions that it’s impossible to assess how impressive the high scores are without knowing more about how OpenAI’s o1 and o3 models work. For example, if o3 had been able to practice ARC problems in advance, the achievement would have been easier. “To understand the full implications of this, we will have to wait for an open source replication,” says Dietterich.
ARC Challenge organizers are already planning to launch a second, more difficult benchmark test in 2025. We will also continue to run the ARC Prize 2025 challenge until someone achieves the target and open-sources the solution.
subject:
- artificial intelligence/
- AI