v0 · public preview Public ARC-AGI-3 games only · Independent study (non-official benchmark result)
Claude Code  ·  cold-solve

Can Claude Code cold-solve ARC-AGI-3?

The agent sees pixels and chooses which button to press. Zero task-specific information. Each turn has a fixed budget. In one hardened playthrough of the public set, Claude Code won games outright and cleared —% of all levels (up from 0% of games and ~1% of levels for the same models with no harness). Every run is logged in full. Source code launching soon.

Premise

The model is not the agent.

Same brain, different body, different score

A language model chooses the next token based on what it has learned and what it can see. An agent acts. That only happens through software that controls what the model can see, asks it what to do next, records what happened, and feeds the result back.

The harness is the model's body. It's the software that turns prediction into action. The agent is the model in that loop.

Outline drawing of the ghostwriter mascot: a smiling face ringed with tentacles
Harness ablations

Strip a component. Does it still solve?

Each configuration removes one piece from the harness · cells are repeat runs

Model sweep

Same harness, different models.

How many games each one cracks

Depth

How deep a single playthrough went.

Game by game · cell = actions to clear a level
levels × games game won died here cut off not reached best run reached
Case study

One run, start to finish.

Setup

What the model gets vs what it doesn’t.

Every instruction is generic

What Claude was given

Pixels in, button index out: the current frame, a fixed set of buttons, and scratch files (theory.md, journal.md) that persist across turns — under a per-run action budget.

What Claude was not given

Any description of the game, the action buttons, or the objective. No reward signal in natural language. No walkthrough, screenshot, video, or documentation about which game it was.

Pick a run

View any run.

Example playthroughs
Open data

Open data and code.

Open to inspection

Every frame, prompt, tool call, reasoning trace, and experimental mistake is here for audit. The harness, runs, and audit code live in one repository. Every number on this page is computed from the data. Each run links to its own timeline.