OTEL + Claude Code Hooks to KPI Dashboards — how we instrument, monitor, and secure every layer of the AI agent stack.
When autonomous agents make decisions, execute code, and interact with production systems, you need visibility into every layer. Not just "is it running" — but what is it doing, what is it costing, and is anyone gaming the system. Traditional APM doesn't cover this. You need purpose-built observability for AI agent infrastructure.
This paper documents the full observability architecture powering Organized AI's managed agent deployments — from Claude Code hooks that capture developer interactions, through OpenTelemetry pipelines that route telemetry, to KPI dashboards that surface actionable metrics and FrawdBot security that catches behavioral anomalies before they become incidents.
The observability architecture spans eight distinct layers, each with specific responsibilities. Data flows upward from instrumentation to storage, while security monitoring operates as a cross-cutting concern with access to every layer.
Every developer interaction with Claude Code generates telemetry. Four hook types capture different event classes:
All hooks feed into .claude/hooks/otel-emit.js, which converts JSON stdin into proper OpenTelemetry spans using @opentelemetry/sdk-node. Spans ship to the OTEL Collector via OTLP/HTTP on port 4318.
The central nervous system. Receives telemetry from both Claude Code Hooks and OpenClaw's diagnostics-otel plugin. A batch processor (5-second timeout, 100-span batches) smooths traffic before fan-out to downstream exporters.
Two tiers of export: Tier 1 sends infrastructure metrics to Prometheus and traces to Tempo. Tier 2 routes LLM-specific data to Langfuse and product analytics to PostHog. This separation keeps infrastructure SRE dashboards fast while giving ML teams their own observability plane.
OpenClaw Gateway on port 3000 handles inbound messages from six channels: WhatsApp, Telegram, Discord, Web API, iMessage, and Signal. ClawRouter scores every request across 14 dimensions in under 1ms, routing to the optimal model tier.
Local inference through Ollama serves four model tiers: SIMPLE (3B parameters), MEDIUM (8B), COMPLEX (70B), and CODE (7B specialized). The diagnostics-otel plugin emits GenAI semantic convention spans for every inference call — model, tokens in/out, latency, and cost.
The economic engine. Three caching strategies combine to eliminate redundant inference costs:
all-MiniLM-L6-v2 embeddings at 0.92 threshold. Similar-enough prompts hit cache.Combined hit rate target: 30-50%. Every cache hit costs $0 — pure margin.
Grafana dashboards on port 3001 surface: Cost Savings, Token Burn Rate, Routing Tier Distribution, Task Completion Rate, Latency p95, SLA Uptime, Cache Hit Rate, and Float Balance.
| Store | Databases | Purpose |
|---|---|---|
| PostgreSQL :5432 | langfuse_db, posthog_db, grafana_db, frawdbot_db, token_float_db | Transactional data, audit trails, billing ledger |
| ClickHouse :8123 | Metrics, Events, Traces, Fraud Analytics, Cache Analytics | High-cardinality OLAP, materialized P&L views |
| S3 Cold Storage | Parquet, DB Backups, Data Lake, Forensic Archive, Billing Archive | Long-term retention, compliance, ML training data |
Data lifecycle: hot (PostgreSQL, real-time) to warm (ClickHouse, analytical) to cold (S3 Parquet, archived). Daily partitioning with Snappy compression. Nightly pg_dump. Weekly ClickHouse backups to S3. Forensic archives are immutable with chain-of-custody logging.
Five detection modules run continuously against all agent activity:
Ingestion spans four sources: Langfuse API traces, direct PostgreSQL reads, Prometheus PromQL baselines, and AlertManager webhooks. Response actions: kill session (Gateway API), adjust trust score (ClawRouter), push alerts (AlertManager), write to forensic storage.
Fleet configuration management for distributed Mac hardware. A Git repository serves as the single source of truth — configs, skill manifests, fleet definitions, and encrypted secrets all version-controlled.
Every Mac runs clawherd-agent, a pull-based daemon syncing every 5 minutes via launchd. The agent resolves its role from inventory.yaml, diffs desired state against actual state, and applies only what changed. Idempotent. No push. No SSH.
A registry.yaml marketplace catalog defines skill bundles: core (free), marketing, sales, product, data, gtm, dev. Per-client resolution combines tier minimums with vertical selection and addon toggles. Skills sync via rsync to ~/.openclaw/skills/ with hot-reload on the Gateway. Semver versioning with canary rollouts and automatic rollback on health check failure.
SOPS + age encryption at rest in Git. The age private key lives in macOS Keychain. Decrypted only on the target Mac at apply time. Never logged. Never in OTEL spans.
| Port | Service | Layer |
|---|---|---|
| :3000 | OpenClaw Gateway | Agent Runtime |
| :3001 | Grafana Dashboards | Visualization |
| :3100 | Langfuse | LLM Observability |
| :3200 | Grafana Tempo | Trace Storage |
| :4000 | FrawdBot Engine | Security |
| :4001 | FrawdBot Dashboard | Security UI |
| :4317 | OTEL Collector gRPC | Internal |
| :4318 | OTEL Collector HTTP | Primary Ingress |
| :5050 | Token Broker | Provider Arbitrage |
| :5432 | PostgreSQL | Data Persistence |
| :6379 | Redis | Semantic Cache |
| :8000 | Coolify Dashboard | Deployment PaaS |
| :8123 | ClickHouse | OLAP Analytics |
| :9090 | Prometheus | Metrics Storage |
| :9093 | AlertManager | Alert Routing |
| :11434 | Ollama | Local Inference |
Seven metric namespaces cover the full stack: