Karpathy's autonomous experimentation loop — applied to Google Tag Manager containers instead of neural nets. Mutate tags, score structurally, keep the winners.
Tracking engineers spend hours manually tuning GTM containers — adjusting consent mode signals, deduplicating events, aligning parameters with Meta and Google Ads. The work is high-leverage but repetitive. What a human does in an afternoon, a small loop can do a hundred times overnight.
GTM AutoResearch is that loop. It borrows the exact structure of Andrej Karpathy's autoresearch project — fixed training budget, a single metric to beat, an agent that mutates one file — and swaps train.py for a GTM container JSON. Claude Haiku proposes a patch, a nine-dimension scorer judges it, and the system keeps what lifts the score and reverts everything else.
This paper is the writeup we're bringing to Measure Summit 2026 — a project about what happens when you treat tracking configuration as an optimization problem an agent can iterate on, and what the measurement community should build on top of it next.
Five steps, one metric, one file of truth. Every round is a controlled experiment: measure before, mutate, measure after, keep or revert. No operator intuition about which change "felt" better — the score decides.
train.py against val_bpb. We optimize a container JSON against a signal quality score. Both get standardized measurement windows, one file to mutate, one number to beat. Replace the scorer and domain, keep the scaffolding.
| Karpathy autoresearch | GTM AutoResearch |
|---|---|
| train.py | container JSON |
| val_bpb (bits per byte) | signal quality score |
| model architecture mutation | tag / trigger / variable mutation |
| fixed 5-min training budget | fixed 5-min training budget |
| validation split | 24-hr signal window |
| program.md → skill file | program.md → SKILL.md |
The scorer is the heart of the system. It reduces an entire GTM container to a single number by answering nine questions a senior tracking engineer would ask in a code review. Structural evaluation runs in seconds on the JSON alone — fast enough to evaluate a hundred experiments in a weekend.
Each round, the scorer surfaces the lowest-scoring dimension to the mutation prompt as a target. Claude Haiku proposes a minimal patch aimed at that weak spot. If the patch lifts the overall score without regressing another dimension beyond a threshold, it's kept. The agent sees the full nine-tuple after each round — the feedback loop is tight and interpretable, not a black box.
Run it overnight. By morning, four artifacts are waiting:
| Deliverable | What you get |
|---|---|
| Staging workspace | GTM workspace with winning config — one-click publish when you're ready |
| Versioned JSON | winning-config.json stored in R2 — rollback to any previous night's best |
| Experiment log | every patch tested, scored, kept or reverted — full audit trail with diffs |
| Playwright QA | each experiment validated in staging preview — tag firing, params, dedup all checked |
The core loop is the beachhead. Beyond it sits a six-phase pipeline that turns each night's winning configs into training data for a client-specialized model. The loop outputs become JSONL; the JSONL becomes a fine-tune; the fine-tune becomes next week's smarter mutator. This is the compounding move — the flywheel.
| Track A | Track B | |
|---|---|---|
| Where | OpenAI cloud | Local Ollama, M3 Ultra |
| Good for | Best quality per token | Privacy + zero cost per token |
| Models | gpt-4o-mini / 4o | llama3 / qwen2.5-coder |
| Registry | Shared model registry with versioned tags — callers don't know which track served them | |
TypeScript-first, no build step, minimal dependencies. The whole loop is small enough to read in one sitting — that's a feature, not an accident. Fewer moving parts means fewer things that break at 3am.
tsx — no compile step, run .ts directlyThree things. First, the loop itself — the Karpathy structure adapted for tag managers, with a structural scorer that the measurement community can critique and extend. The nine dimensions were picked from production code reviews; the weights are tunable; every regression is logged. It's ready to inspect.
Second, the scorer as a standard. A tag manager's quality is often argued about in vibes ("this container is clean", "that one's a mess"). Nine structural dimensions give us a shared vocabulary. You don't have to agree with our weights — you have to point at a specific dimension and argue about it.
Third, the compounding pipeline. The loop outputs become training data. Training data becomes client-specialized brains. Client brains make next week's experiments smarter. We'll show the Phase 1–6 architecture, the dual-track fine-tune strategy, and the OpenClaw routing that makes brain swaps invisible to callers.
If you're at Measure Summit 2026 and you want the architecture conversation — the scorer weights, the structural vs. live tradeoff, what the nine dimensions miss, how the flywheel breaks — come find us. That's what this project is for.
This isn't a solved system. It's a beachhead. The interesting questions we're still working on: