← Back to Organized AI
April 2026 · Jordaaan Hill · 11 min read

GTM AutoResearch

Karpathy's autonomous experimentation loop — applied to Google Tag Manager containers instead of neural nets. Mutate tags, score structurally, keep the winners.

⚡ A Measure Summit 2026 Project · measuresummit.com

Why This Exists

Tracking engineers spend hours manually tuning GTM containers — adjusting consent mode signals, deduplicating events, aligning parameters with Meta and Google Ads. The work is high-leverage but repetitive. What a human does in an afternoon, a small loop can do a hundred times overnight.

GTM AutoResearch is that loop. It borrows the exact structure of Andrej Karpathy's autoresearch project — fixed training budget, a single metric to beat, an agent that mutates one file — and swaps train.py for a GTM container JSON. Claude Haiku proposes a patch, a nine-dimension scorer judges it, and the system keeps what lifts the score and reverts everything else.

This paper is the writeup we're bringing to Measure Summit 2026 — a project about what happens when you treat tracking configuration as an optimization problem an agent can iterate on, and what the measurement community should build on top of it next.

Live Artifacts

The Loop

Five steps, one metric, one file of truth. Every round is a controlled experiment: measure before, mutate, measure after, keep or revert. No operator intuition about which change "felt" better — the score decides.

STEP 1 — MODIFY CONFIG Claude Haiku mutates one tag, trigger, or variable mapping. Minimal patch. Never sweeping rewrites. │ ▼ STEP 2 — DEPLOY Push to the GTM staging workspace (autoresearch-nightly). Never to live. Safety is architectural, not a flag. │ ▼ STEP 3 — MEASURE Structural scorer evaluates the container JSON. 24-hour signal window available for live comparisons. │ ▼ STEP 4 — KEEP / REVERT if (after > before) → keep the diff, log "kept" else → roll back, log "revert" │ ▼ STEP 5 — REPEAT ~30 rounds per run. ~100 experiments over a weekend. ~$0.15 on Claude Haiku per full run.
Why the same loop works here: Karpathy's autoresearch optimizes train.py against val_bpb. We optimize a container JSON against a signal quality score. Both get standardized measurement windows, one file to mutate, one number to beat. Replace the scorer and domain, keep the scaffolding.

Karpathy vs. GTM — Side by Side

Karpathy autoresearchGTM AutoResearch
train.pycontainer JSON
val_bpb (bits per byte)signal quality score
model architecture mutationtag / trigger / variable mutation
fixed 5-min training budgetfixed 5-min training budget
validation split24-hr signal window
program.md → skill fileprogram.md → SKILL.md

The 9-Dimension Scorer

The scorer is the heart of the system. It reduces an entire GTM container to a single number by answering nine questions a senior tracking engineer would ask in a code review. Structural evaluation runs in seconds on the JSON alone — fast enough to evaluate a hundred experiments in a weekend.

Each round, the scorer surfaces the lowest-scoring dimension to the mutation prompt as a target. Claude Haiku proposes a minimal patch aimed at that weak spot. If the patch lifts the overall score without regressing another dimension beyond a threshold, it's kept. The agent sees the full nine-tuple after each round — the feedback loop is tight and interpretable, not a black box.

Why structural, not live: live signal comparisons take days — you need real traffic, real conversion volume, and patience. Structural scoring runs in milliseconds on the JSON. We use live signals for weekly validation, not for the inner loop. This is what unlocks the hundred-experiment weekend.

Morning Deliverables

Run it overnight. By morning, four artifacts are waiting:

DeliverableWhat you get
Staging workspaceGTM workspace with winning config — one-click publish when you're ready
Versioned JSONwinning-config.json stored in R2 — rollback to any previous night's best
Experiment logevery patch tested, scored, kept or reverted — full audit trail with diffs
Playwright QAeach experiment validated in staging preview — tag firing, params, dedup all checked

Typical Weekend Outcome

Operator contract: the loop never touches production. A human reviews the winner and clicks publish in the GTM UI. The staging workspace is the only place changes can land, and it's hardcoded — not a toggle you could forget to set. Every rejected patch is logged with its regression dimension, so nothing disappears silently.

The Six-Phase Fine-Tune Pipeline

The core loop is the beachhead. Beyond it sits a six-phase pipeline that turns each night's winning configs into training data for a client-specialized model. The loop outputs become JSONL; the JSONL becomes a fine-tune; the fine-tune becomes next week's smarter mutator. This is the compounding move — the flywheel.

PHASE 1 — EXPERIMENT LOGGER Zod schema → SQLite WAL → idempotent INSERT OR IGNORE. One row per round. Never duplicates on rerun. │ ▼ PHASE 2 — ACCOUNT STATE COLLECTOR MCP tool calls assemble an AccountState blob. GTM containers + Google Ads accounts + Meta (Pipeboard). │ ▼ PHASE 3 — JSONL TRAINING DATA Score filter (keep high-signal rounds) → Chroma dedup (drop near-duplicate patches) → training.jsonl │ ▼ PHASE 4 — FINE-TUNE RUNNER (dual track) Track A: OpenAI cloud (gpt-4o-mini / 4o) — best quality Track B: Local Ollama on M3 Ultra — zero per-token cost Shared versioned model registry. │ ▼ PHASE 5 — OPENCLAW CLIENT BRAIN OpenClaw :18789 routes GTM prompts to the fine-tune. Fallback middleware → generalist model if cold or drift-flagged. │ ▼ PHASE 6 — THE FLYWHEEL Watcher events trigger rebuild checks. Drift detection auto-rolls back below generalist baseline. Every night's winners improve next week's starting prompts.

Dual Track — Why Both

Track ATrack B
WhereOpenAI cloudLocal Ollama, M3 Ultra
Good forBest quality per tokenPrivacy + zero cost per token
Modelsgpt-4o-mini / 4ollama3 / qwen2.5-coder
RegistryShared model registry with versioned tags — callers don't know which track served them

Stack & Economics

TypeScript-first, no build step, minimal dependencies. The whole loop is small enough to read in one sitting — that's a feature, not an accident. Fewer moving parts means fewer things that break at 3am.

Runtime

Cost Calibration

Why Haiku: the agent's job is small, local mutations against a well-defined schema. It's not writing prose or planning — it's editing JSON one node at a time. Haiku's speed and cost profile mean we can afford the hundred-experiment weekend. Every extra dollar would buy fewer rounds, not better ones.

What We're Bringing to Measure Summit 2026

Three things. First, the loop itself — the Karpathy structure adapted for tag managers, with a structural scorer that the measurement community can critique and extend. The nine dimensions were picked from production code reviews; the weights are tunable; every regression is logged. It's ready to inspect.

Second, the scorer as a standard. A tag manager's quality is often argued about in vibes ("this container is clean", "that one's a mess"). Nine structural dimensions give us a shared vocabulary. You don't have to agree with our weights — you have to point at a specific dimension and argue about it.

Third, the compounding pipeline. The loop outputs become training data. Training data becomes client-specialized brains. Client brains make next week's experiments smarter. We'll show the Phase 1–6 architecture, the dual-track fine-tune strategy, and the OpenClaw routing that makes brain swaps invisible to callers.

If you're at Measure Summit 2026 and you want the architecture conversation — the scorer weights, the structural vs. live tradeoff, what the nine dimensions miss, how the flywheel breaks — come find us. That's what this project is for.

Open Questions

This isn't a solved system. It's a beachhead. The interesting questions we're still working on:

Coming out of Measure Summit 2026: we want a shared benchmark suite for tag manager structural quality — one a vendor-neutral community can agree on. GTM AutoResearch is our contribution; the scorer is the first draft of that benchmark. Tear it apart and help us make version two.

Continue Reading

Enter your email to unlock the full GTM AutoResearch writeup — the 9-dimension scorer, the six-phase fine-tune pipeline, and what we're bringing to Measure Summit 2026.

We'll send you updates on Organized AI infrastructure. No spam. Unsubscribe anytime.