March 2026 · Jordaaan Hill & Colin McNamara · 8 min read

The Infrastructure Playbook

Building Managed AI Agent Services from First Principles

The gap between a working agent and a production agent is entirely infrastructure. This paper identifies the architectural patterns, observability requirements, security considerations, and compliance frameworks that managed AI agent services must address — derived from production deployments.

Architecture SOC 2 SRE Observability Security

1. The Infrastructure Gap

AI agents are production-ready. The infrastructure to run them is not.

Over 40,000 agent platforms exist today. The framework landscape is mature and intensely competitive. Any team with a weekend and an API key can build an agent that does something useful.

But "something useful" is not "production." Production means monitoring at 3 AM. Production means SOC 2 auditors asking for architecture diagrams. Production means API keys that rotate before they expire. Production means a caching layer that handles 10x traffic spikes during quarterly reporting. Production means someone answering the phone when the agent starts sending emails to the wrong people.

The gap between a working agent and a production agent is entirely infrastructure. This paper documents the infrastructure patterns that close that gap.

2. Architecture Patterns

Managed agent infrastructure requires six architectural concerns that no framework handles natively. Each concern maps to a distinct infrastructure layer.

[Agent Framework Layer]
  Any framework: the customer's choice
        |
        v
[Routing + Policy Layer]
  Authentication, authorization, provider selection
        |
        v
[Observability Layer]          [Durability Layer]
  Real-time dashboards           Long-term storage
  Sub-second aggregation         Compliance records
        |                              |
        +---------- combined ----------+
                       |
                       v
              [Security Layer]
        Behavioral monitoring + alerting

The key design principle is separation of concerns. The agent framework handles reasoning and tool execution. The infrastructure handles everything else: routing, caching, monitoring, security, and compliance. This separation means you can swap frameworks without rebuilding infrastructure, and upgrade infrastructure without touching the agent.

The Dual-Database Pattern

The most counterintuitive architecture decision in production agent infrastructure is running two database layers in parallel. It sounds redundant. It is essential.

Agent workloads produce two fundamentally different query patterns. Operators need real-time dashboards that aggregate millions of log entries across dozens of agents, grouped by status and response time percentile, refreshing in milliseconds. Compliance teams need durable transactional records that survive hardware failures and satisfy auditors.

No single database architecture optimizes for both. The solution is a columnar query acceleration layer for real-time analytics sitting alongside a row-oriented system of record for durability. All agent events flow into both simultaneously.

10xQuery Acceleration

<200msDashboard Refresh

10:1Log Compression

Key Principle

Columnar storage compresses agent log data at roughly 10:1 ratios. Aggregation queries that scan only the columns needed (timestamp, status, latency) skip the columns not needed (request body, response body, headers), reducing I/O by orders of magnitude. Do not choose one database or the other. Use both. The replication cost is negligible compared to the operational benefit.

3. The Observability Challenge

Agent observability is fundamentally different from traditional application monitoring. Agents make autonomous decisions, call tools in unpredictable sequences, and interact with external systems in ways that depend on conversational context. Standard APM tools were not designed for this.

Three-Domain Monitoring

Production agent observability requires monitoring across three domains simultaneously:

Utilization: Resource consumption across all deployment hardware. Token consumption by model provider. Queue depths for pending tasks. These metrics identify capacity constraints before they cause failures.

Faults: Error rates by agent, tool, and API endpoint. Timeout frequency. Provider availability. Security alert severity. These metrics identify active problems.

Operations: Deployment history, configuration changes, credential rotation status, patch levels. These metrics track infrastructure health distinct from the agents running on it.

Request-Level Tracing

Every agent interaction — from initial prompt to final response — must be captured as a trace with spans for each tool call, model invocation, and processing step. This trace data serves dual purposes: real-time latency analysis for operators and compliance artifacts for auditors.

Insight

Auto-generated architecture diagrams from production traces are not just documentation — they are audit artifacts. SOC 2 reviewers need data flow diagrams. Due diligence teams need system architecture. Generating these from live traces satisfies both audiences with zero manual documentation effort.

4. Intelligent Routing

The routing layer is the control plane of managed agent infrastructure. Every request from every agent passes through it, making it the natural enforcement point for security, cost management, and observability.

Multi-Stage Request Processing

Each incoming request passes through a pipeline:

Authentication: Verify the request comes from a registered agent on authorized hardware.
Policy enforcement: Confirm the action is permitted by configuration. A marketing agent cannot access financial data. An HR agent cannot modify production systems.
Provider selection: Route to the optimal model provider based on task complexity, cost, and availability. Simple tasks go to efficient models. Complex reasoning goes to frontier models.
Content filtering: Scan for prompt injection attempts before the request reaches the model.
Response capture: Log the complete request-response pair for monitoring and compliance.

Provider Abstraction

Agents should not know or care which model provider serves their requests. The routing layer abstracts multiple providers behind a unified API, enabling cost optimization (route to the cheapest provider that meets quality requirements), automatic failover during outages, and negotiation leverage across providers without code changes.

Key Principle

Revenue comes from infrastructure management, not direct inference. The managed service provider handles routing, monitoring, security, and support. The margin is in the management, not the compute. Granular token metering per agent, per customer, per provider enables usage-based billing and cost attribution at the department level.

5. Security as Infrastructure

AI agents with access to business systems represent a novel threat vector. They can exfiltrate data at machine speed, modify records in bulk, and impersonate authorized users through the tools they control. Traditional security monitoring — designed for human-speed actions — misses agent-speed threats entirely.

Agent security requires purpose-built behavioral analysis that monitors against rolling statistical baselines and flags deviations indicating compromised or misconfigured agents.

The Behavioral Approach

Effective agent security monitors patterns, not individual actions:

Volume anomalies: An agent suddenly operating at 10x its normal throughput
Scope anomalies: An agent accessing data outside its established patterns
Temporal anomalies: Activity outside established operating windows
Escalation patterns: Gradual privilege creep that individually looks normal but collectively represents a threat
Coordination detection: Correlated anomalies across multiple agents suggesting coordinated activity

Insight

Security monitoring must be bundled with every deployment, not upsold as a premium tier. An agent deployment without behavioral monitoring is a liability, not a product. Shipping agent infrastructure without security is like shipping a car without brakes — technically possible, professionally negligent.

Co-locating security monitoring with the agents it watches means detection latency is measured in milliseconds. No data leaves the customer's network for security analysis. This is critical when an agent can exfiltrate thousands of records in the time a cloud-based system processes the first alert.

6. Compliance by Design

Every component of production agent infrastructure should map to established trust service criteria. This is not an afterthought — it is a first-class design constraint.

Security: Behavioral alerts, access logs, injection detection
Availability: Uptime metrics, failover history, provider health
Processing Integrity: Request-response traces, tool execution logs
Confidentiality: Data access patterns, encryption status, credential rotation
Privacy: PII detection in agent interactions, data retention compliance

When the infrastructure is designed with compliance as a constraint rather than a feature, the evidence is already organized when the auditor arrives. No scramble. No retrofit. The monitoring you need for operations is the same monitoring the auditor needs for certification.

7. Conclusion: Build for the Build

The infrastructure patterns documented here — dual-database observability, multi-stage routing, three-domain monitoring, behavioral security, and compliance-by-design — address the operational requirements that no agent framework handles.

Frameworks give agents capabilities. Infrastructure gives agents reliability, security, observability, and compliance. The market has 40,000+ options for capabilities. The market has almost no options for infrastructure.

The infrastructure layer is the defensible business. Frameworks will consolidate. The survivors will still need the same monitoring, the same security, the same observability, the same routing. Platform-agnostic infrastructure means revenue that grows regardless of which framework wins.

Build for the build. The teams deploying agents need someone managing what is underneath. That is the service. That is the business. That is the playbook.

Ready to discuss the infrastructure patterns that apply to your deployment?

Schedule a Consultation

References

AICPA. "SOC 2 Trust Service Criteria." aicpa.org/soc
McNamara, Colin. "Self-Improving Code: Enterprise Agent Architecture." selfimprovingcode.ai
Hill, Jordaaan. "The Agent Infrastructure Stack." Organized AI Papers, March 2026.
Hill, Jordaaan and McNamara, Colin. "Edge Compute Economics." Organized AI Papers, March 2026.