Can Claude Code cold-solve ARC-AGI-3?

The agent sees pixels and chooses which button to press. Zero task-specific information. Each turn has a fixed budget. In one hardened playthrough of the public set, Claude Code with Fable 5 won — games outright and cleared —% of all levels. The identical harness with Opus 4.8 won —. Every run is logged in full. Source code launching soon.

See how far it got ↓ Pick a run

Premise

The model is not the agent.

Same brain, different body, different score

A language model chooses the next token based on what it has learned and what it can see. An agent acts. That only happens through software that controls what the model can see, asks it what to do next, records what happened, and feeds the result back.

The harness is the model's body. It's the software that turns prediction into action. The agent is the model in that loop.

Outline drawing of the ghostwriter mascot: a smiling face ringed with tentacles

Harness ablations

Strip a component. Does it still solve?

Each configuration removes one piece from the harness · cells are repeat runs

Model sweep

Same harness, different models.

How many games each one cracks

Depth

How deep a single playthrough went.

Game by game · cell = actions to clear a level

levels × games game won died here cut off not reached best run reached

Case study

One run, start to finish.

Setup

What the model gets vs what it doesn’t.

Every instruction is generic

What Claude was given

Pixels in, button index out: the current frame, a fixed set of buttons, and scratch files (theory.md, journal.md) that persist across turns — under a per-run action budget.

What Claude was not given

Any description of the game, the action buttons, or the objective. No reward signal in natural language. No walkthrough, screenshot, video, or documentation about which game it was.

What the harness does between sessions

Long runs are supervised. Sessions are checkpointed and restarted to survive timeouts and dropped connections, and a session that spends its whole time analysing without committing an action is told to act rather than plan forever. These interventions shape effort — when the model runs, and that it acts — never content. The supervisor supplies no mechanics, hints, or solutions, and never edits or retries a committed action. Every intervention is disclosed with the harness source.

Pick a run

View any run.

Example playthroughs

Browse all runs →

Open data

Open data and code.

Open to inspection

Every frame, prompt, tool call, reasoning trace, and experimental mistake is here for audit. The harness, runs, and audit code live in one repository. Every number on this page is computed from the data. Each run links to its own timeline.