Skip to content


ARC-AGI-3 = The New AI Intelligence Test Where Every Model Basically Failed

ARC-AGI-3 = The New AI Intelligence Test Where Every Model Basically Failed

Via The Neuron
“You know those moments where you download a new app, poke at it for two minutes, and just figure it out? No tutorial, no instructions?
Turns out AI can’t do that. At all.
The ARC Prize Foundation just launched ARC-AGI-3, a new benchmark (standardized test for AI) designed to measure whether AI can learn new things on the fly, the way humans do. Instead of answering trivia or writing code, AI agents explore unfamiliar interactive environments and solve puzzles they’ve never seen before.
“AGI” (which = artificial general intelligence) is the finish line every major AI company is racing toward: AI that handles any new task without special training.
AGI is a major goal. OpenAI just renamed its product division “AGI Deployment.” Jensen Huang said AGI is “already in the room.” OpenAI’s next model, codenamed “Spud,” might be the first they claim as AGI. ARC-AGI-3 is meant to actually test these claims.
Here’s what happened after it launched:
  • Every frontier model scored under 1%. Gemini 3.1 Pro: 0.37%. GPT-5.4: 0.26%. Claude Opus 4.6: 0.25%. Grok 4.2: 0%.
  • 100% of human testers solved every environment on their first try. No instructions, no training.
  • The benchmark tests 135 novel environments (~1,000 levels), measuring how efficiently AI solves them compared to humans.
  • $2M prize competition is live on Kaggle, and you can play the public games yourself.
Why this matters: Not everyone agrees the test is fair. The scoring uses a squared efficiency penalty (if a human takes 10 steps and AI takes 100, the AI scores 1%). AI can’t score higher than humans even when more efficient, but humans do get bonus points when they’re more efficient than AI. Extended-thinking models were excluded. Critic @scaling01 argued the methodology was designed to produce low scores.
ARC founder François Chollet fired back with a deeper point: today’s models only perform well when humans build elaborate scaffolding around them (specific prompts, custom harnesses, thinking tricks). The scaffolding is the human intelligence; the model is just executing it. If it’s truly AGI, there should be no human in the loop.
Notice something interesting? We’re no longer arguing about whether AI is smart. We’re arguing about how to measure smart.
Our take: The real question isn’t whether today’s models pass (eventually they will). It’s whether training on massive datasets w/ the current architecture can ever produce genuine adaptability, or whether something fundamentally different is needed. NYU professor Saining Xie made a provocative case that LLMs are “anti-Bitter Lesson” because they’re built entirely on human-generated knowledge rather than learning from raw experience (more on this in our new podcast ep!).
The models that eventually crack ARC-AGI-3 won’t just be smarter; they’ll be a different kind of smart. The kind that matters if you actually care about AGI. If you don’t, today’s models are perfectly fine tools… they’ll just always require potentially billions in human hand-holding via training. Perhaps that’s for the best.”
  • Pro plugin deactivated or invalid

Posted on: March 27, 2026, 6:36 am Category: Uncategorized

0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.