Can Claude Code cold-solve ARC-AGI-3?
The agent sees pixels and chooses which button to press. Zero task-specific information. Each turn has a fixed budget. In one hardened playthrough of the public set, Claude Code won — games outright and cleared —% of all levels (up from 0% of games and ~1% of levels for the same models with no harness). Every run is logged in full. Source code launching soon.
The model is not the agent.
A language model chooses the next token based on what it has learned and what it can see. An agent acts. That only happens through software that controls what the model can see, asks it what to do next, records what happened, and feeds the result back.
The harness is the model's body. It's the software that turns prediction into action. The agent is the model in that loop.
Strip a component. Does it still solve?
Same harness, different models.
How deep a single playthrough went.
One run, start to finish.
What the model gets vs what it doesn’t.
What Claude was given
Pixels in, button index out: the current frame, a fixed set of
buttons, and scratch files (theory.md, journal.md)
that persist across turns — under a per-run action budget.
What Claude was not given
Any description of the game, the action buttons, or the objective. No reward signal in natural language. No walkthrough, screenshot, video, or documentation about which game it was.
View any run.
Open data and code.
Every frame, prompt, tool call, reasoning trace, and experimental mistake is here for audit. The harness, runs, and audit code live in one repository. Every number on this page is computed from the data. Each run links to its own timeline.