GTM AutoResearch

The Loop

Five steps, one metric, one file of truth. Every round is a controlled experiment: measure before, mutate, measure after, keep or revert. No operator intuition about which change "felt" better — the score decides.

STEP 1 — MODIFY CONFIG Claude Haiku mutates one tag, trigger, or variable mapping. Minimal patch. Never sweeping rewrites. │ ▼ STEP 2 — DEPLOY Push to the GTM staging workspace (autoresearch-nightly). Never to live. Safety is architectural, not a flag. │ ▼ STEP 3 — MEASURE Structural scorer evaluates the container JSON. 24-hour signal window available for live comparisons. │ ▼ STEP 4 — KEEP / REVERT if (after > before) → keep the diff, log "kept" else → roll back, log "revert" │ ▼ STEP 5 — REPEAT ~30 rounds per run. ~100 experiments over a weekend. ~$0.15 on Claude Haiku per full run.

Why the same loop works here: Karpathy's autoresearch optimizes train.py against val_bpb. We optimize a container JSON against a signal quality score. Both get standardized measurement windows, one file to mutate, one number to beat. Replace the scorer and domain, keep the scaffolding.

Karpathy vs. GTM — Side by Side

Karpathy autoresearch	GTM AutoResearch
train.py	container JSON
val_bpb (bits per byte)	signal quality score
model architecture mutation	tag / trigger / variable mutation
fixed 5-min training budget	fixed 5-min training budget
validation split	24-hr signal window
program.md → skill file	program.md → SKILL.md

The 9-Dimension Scorer

The scorer is the heart of the system. It reduces an entire GTM container to a single number by answering nine questions a senior tracking engineer would ask in a code review. Structural evaluation runs in seconds on the JSON alone — fast enough to evaluate a hundred experiments in a weekend.

Tag coverage — are ecom events and infrastructure tags present
Parameter completeness — are required params populated on every tag
Deduplication — is the event ID generator configured and wired
Consent Mode v2 — are GCS/GCD signals plumbed correctly
Naming conventions — do tags, triggers, and variables follow house rules
Variable hygiene — no orphans, no duplicates, typed cleanly
Trigger quality — specific, non-overlapping, correctly scoped
Folder organization — tags grouped by purpose, not dumped flat
Meta Ads alignment — weighted x2, the biggest lever on revenue attribution

Each round, the scorer surfaces the lowest-scoring dimension to the mutation prompt as a target. Claude Haiku proposes a minimal patch aimed at that weak spot. If the patch lifts the overall score without regressing another dimension beyond a threshold, it's kept. The agent sees the full nine-tuple after each round — the feedback loop is tight and interpretable, not a black box.

Why structural, not live: live signal comparisons take days — you need real traffic, real conversion volume, and patience. Structural scoring runs in milliseconds on the JSON. We use live signals for weekly validation, not for the inner loop. This is what unlocks the hundred-experiment weekend.

Morning Deliverables

Run it overnight. By morning, four artifacts are waiting:

Deliverable	What you get
Staging workspace	GTM workspace with winning config — one-click publish when you're ready
Versioned JSON	winning-config.json stored in R2 — rollback to any previous night's best
Experiment log	every patch tested, scored, kept or reverted — full audit trail with diffs
Playwright QA	each experiment validated in staging preview — tag firing, params, dedup all checked

Typical Weekend Outcome

Rounds run: ~100 across two nights
Rounds kept: ~25 (quality gate rejects ~75% of proposed patches)
Score lift: 0.72 → 0.91 typical overall quality move
Tags modified: 10–20 across consent, dedup, and param completeness
Total cost: ~$0.30 in Claude Haiku tokens
Operator time: 5 minutes — review changelog, click publish

Operator contract: the loop never touches production. A human reviews the winner and clicks publish in the GTM UI. The staging workspace is the only place changes can land, and it's hardcoded — not a toggle you could forget to set. Every rejected patch is logged with its regression dimension, so nothing disappears silently.

The Six-Phase Fine-Tune Pipeline

The core loop is the beachhead. Beyond it sits a six-phase pipeline that turns each night's winning configs into training data for a client-specialized model. The loop outputs become JSONL; the JSONL becomes a fine-tune; the fine-tune becomes next week's smarter mutator. This is the compounding move — the flywheel.

PHASE 1 — EXPERIMENT LOGGER Zod schema → SQLite WAL → idempotent INSERT OR IGNORE. One row per round. Never duplicates on rerun. │ ▼ PHASE 2 — ACCOUNT STATE COLLECTOR MCP tool calls assemble an AccountState blob. GTM containers + Google Ads accounts + Meta (Pipeboard). │ ▼ PHASE 3 — JSONL TRAINING DATA Score filter (keep high-signal rounds) → Chroma dedup (drop near-duplicate patches) → training.jsonl │ ▼ PHASE 4 — FINE-TUNE RUNNER (dual track) Track A: OpenAI cloud (gpt-4o-mini / 4o) — best quality Track B: Local Ollama on M3 Ultra — zero per-token cost Shared versioned model registry. │ ▼ PHASE 5 — OPENCLAW CLIENT BRAIN OpenClaw :18789 routes GTM prompts to the fine-tune. Fallback middleware → generalist model if cold or drift-flagged. │ ▼ PHASE 6 — THE FLYWHEEL Watcher events trigger rebuild checks. Drift detection auto-rolls back below generalist baseline. Every night's winners improve next week's starting prompts.

Dual Track — Why Both

	Track A	Track B
Where	OpenAI cloud	Local Ollama, M3 Ultra
Good for	Best quality per token	Privacy + zero cost per token
Models	gpt-4o-mini / 4o	llama3 / qwen2.5-coder
Registry	Shared model registry with versioned tags — callers don't know which track served them

Stack & Economics

TypeScript-first, no build step, minimal dependencies. The whole loop is small enough to read in one sitting — that's a feature, not an accident. Fewer moving parts means fewer things that break at 3am.

Runtime

TypeScript via tsx — no compile step, run .ts directly
Zod — ExperimentRecord and AccountState schemas
SQLite WAL — experiment log, idempotent inserts
Chroma — vector dedup for training data
Playwright — staging QA for every kept experiment
Cloudflare R2 — versioned winning-config.json storage
OpenClaw :18789 — request routing to the fine-tuned brain

Cost Calibration

~30 rounds × ~3K tokens = ~90K tokens per run
On Claude Haiku: ~$0.15 per full run
A weekend of runs (2 nights): ~$0.30 — cheaper than one pour-over coffee

Why Haiku: the agent's job is small, local mutations against a well-defined schema. It's not writing prose or planning — it's editing JSON one node at a time. Haiku's speed and cost profile mean we can afford the hundred-experiment weekend. Every extra dollar would buy fewer rounds, not better ones.

What We're Bringing to Measure Summit 2026

Three things. First, the loop itself — the Karpathy structure adapted for tag managers, with a structural scorer that the measurement community can critique and extend. The nine dimensions were picked from production code reviews; the weights are tunable; every regression is logged. It's ready to inspect.

Second, the scorer as a standard. A tag manager's quality is often argued about in vibes ("this container is clean", "that one's a mess"). Nine structural dimensions give us a shared vocabulary. You don't have to agree with our weights — you have to point at a specific dimension and argue about it.

Third, the compounding pipeline. The loop outputs become training data. Training data becomes client-specialized brains. Client brains make next week's experiments smarter. We'll show the Phase 1–6 architecture, the dual-track fine-tune strategy, and the OpenClaw routing that makes brain swaps invisible to callers.

If you're at Measure Summit 2026 and you want the architecture conversation — the scorer weights, the structural vs. live tradeoff, what the nine dimensions miss, how the flywheel breaks — come find us. That's what this project is for.

Open Questions

This isn't a solved system. It's a beachhead. The interesting questions we're still working on:

Scorer drift — as consent frameworks evolve (CMV2 → whatever comes next), the weights need recalibration. Who owns the canonical weights? Per-vertical or shared?
Live vs. structural divergence — the structural score correlates with live signal quality, but not perfectly. Where does the correlation break?
Regression bisection — with 100+ kept experiments per week, attributing a Monday-morning regression to a specific round requires better diff tooling than we have today.
Multi-vertical generalization — the Shopify ecom template is our reference. How much carries over to lead-gen, SaaS, publishers without bespoke weight tuning?

Coming out of Measure Summit 2026: we want a shared benchmark suite for tag manager structural quality — one a vendor-neutral community can agree on. GTM AutoResearch is our contribution; the scorer is the first draft of that benchmark. Tear it apart and help us make version two.

Why This Exists

Live Artifacts

The Loop

Karpathy vs. GTM — Side by Side

The 9-Dimension Scorer

Morning Deliverables

Typical Weekend Outcome

The Six-Phase Fine-Tune Pipeline

Dual Track — Why Both

Stack & Economics

Runtime

Cost Calibration

What We're Bringing to Measure Summit 2026

Open Questions

Continue Reading