Measuring Agents With Interactive Evaluations
Based on OpenAI's video on YouTube. If you like this content, support the original creators by watching, liking and subscribing to their content.
ARC AGI 3 is designed to evaluate generalization and learning efficiency through interactive, instruction-free environments rather than static question answering.
Briefing
Frontier AI progress needs more than “right answers” in fixed settings; it requires interactive benchmarks that measure how efficiently an agent learns and acts in novel environments. Arc Prize Foundation frames intelligence as skill acquisition efficiency—essentially how well a system can learn new things—and argues that generalization can’t be judged by static question/answer tests. Instead, evaluation should mirror how intelligence unfolds in the real world: through perception, feedback, planning, and action over time.
Arc Prize Foundation’s approach centers on interactive agents and a new benchmark series, ARC AGI 3: 150 open-sourced video game environments designed to test adaptation to unseen situations. Each game uses entirely new mechanics (not just new levels of the same template), with no natural-language instructions. Players must infer the environment’s goals and figure out how to reach them through exploration and trial. The benchmark is split into a public test set and a private evaluation set: researchers can familiarize themselves with the interface and format publicly, but final scoring relies on private games the developer and models have not seen.
A key design goal is to include problems that are easy for humans but hard for AI. Humans serve as the “only proof point” of general intelligence, so games are only accepted if recruited members of the general public can solve them on a first run—meaning they have never seen the games before. Rather than tracking only whether a level is completed, ARC AGI 3 records how many actions (turns) it takes to complete each task. That metric becomes “action efficiency,” a measure of how directly an agent converts environmental information into progress toward a goal.
This action-efficiency lens is presented as a way to quantify not just capability, but learning efficiency. In the Pokémon example, GPT5’s progress is visualized against the number of actions needed to hit milestones, with models that require more actions showing a lower efficiency slope. The same logic is applied to ARC AGI 3: human first-run data establishes a baseline for how quickly general intelligence can operate, while model data reveals whether systems reach goals through efficient learning or brute-force exploration.
The benchmark also aims to expose a “human AI gap.” Even when a model completes tasks, it may do so with far more actions than humans, indicating weaker information-to-value conversion. In one agentic game example (LS20), GPT5 spends many actions without meaningful progress, while humans complete the level in far fewer turns—evidence of a gap between action volume and goal-directed effectiveness.
Scoring for ARC AGI 3 therefore includes both traditional coverage (how many levels and games are solved) and action efficiency. The foundation claims that if a system matches or surpasses human-level action efficiency while generalizing to unseen environments, it would represent the strongest evidence yet of general intelligence—but ARC AGI 3 itself is not treated as proof of AGI. The immediate call to action is to play the six preview games at 3.artprize.org and, for agent developers, use an API to run interactive evaluations with their own systems.
Cornell Notes
ARC AGI 3 is built to measure generalization and learning efficiency using interactive game environments rather than static benchmarks. Intelligence is defined as skill acquisition efficiency, so evaluation tracks not only whether an agent reaches a goal, but how many actions (turns) it takes to do so—“action efficiency.” Games have novel mechanics, no instructions, and are split into public format exposure and private evaluation on unseen instances. Human first-run performance (from recruited members of the general public) sets the baseline for action efficiency, enabling a “human AI gap” analysis. The benchmark is positioned as strong evidence of generalization, with the most meaningful threshold being matching or surpassing human-level action efficiency on unseen environments.
Why does the benchmark focus on interactive evaluations instead of static question answering?
How does ARC AGI 3 operationalize “generalization” during evaluation?
What does “action efficiency” measure, and why is it central to the benchmark?
What does “easy for humans, hard for AI” mean in practice for game selection?
How does the benchmark detect a “human AI gap”?
What would count as the strongest evidence of general intelligence under this framework?
Review Questions
- How does ARC AGI 3’s public/private split help separate learning the interface from learning the actual evaluation instances?
- Why might a model that completes many levels still fail the benchmark’s deeper notion of intelligence?
- What role do human first-run action counts play in defining action efficiency and the human AI gap?
Key Points
- 1
ARC AGI 3 is designed to evaluate generalization and learning efficiency through interactive, instruction-free environments rather than static question answering.
- 2
Intelligence is framed as skill acquisition efficiency, motivating metrics that reflect how quickly systems learn and act, not just whether they succeed.
- 3
Action efficiency—measured as the number of actions (turns) needed to complete tasks—adds a new dimension beyond accuracy.
- 4
Games are built with novel mechanics and are split into public format exposure and private evaluation on unseen instances to test true generalization.
- 5
Game selection prioritizes puzzles that general public participants can solve on a first run, ensuring a human baseline for efficient intelligence.
- 6
The benchmark’s “human AI gap” highlights cases where models use many more actions than humans to reach the same goals, suggesting brute-force rather than efficient learning.
- 7
Even strong results are treated as evidence of generalization rather than proof of AGI, with the key signal being human-level or better action efficiency on unseen environments.