2. Architecture Patterns
Managed agent infrastructure requires six architectural concerns that no framework handles natively. Each concern maps to a distinct infrastructure layer.
[Agent Framework Layer]
Any framework: the customer's choice
|
v
[Routing + Policy Layer]
Authentication, authorization, provider selection
|
v
[Observability Layer] [Durability Layer]
Real-time dashboards Long-term storage
Sub-second aggregation Compliance records
| |
+---------- combined ----------+
|
v
[Security Layer]
Behavioral monitoring + alerting
The key design principle is separation of concerns. The agent framework handles reasoning and tool execution. The infrastructure handles everything else: routing, caching, monitoring, security, and compliance. This separation means you can swap frameworks without rebuilding infrastructure, and upgrade infrastructure without touching the agent.
The Dual-Database Pattern
The most counterintuitive architecture decision in production agent infrastructure is running two database layers in parallel. It sounds redundant. It is essential.
Agent workloads produce two fundamentally different query patterns. Operators need real-time dashboards that aggregate millions of log entries across dozens of agents, grouped by status and response time percentile, refreshing in milliseconds. Compliance teams need durable transactional records that survive hardware failures and satisfy auditors.
No single database architecture optimizes for both. The solution is a columnar query acceleration layer for real-time analytics sitting alongside a row-oriented system of record for durability. All agent events flow into both simultaneously.
10xQuery Acceleration
<200msDashboard Refresh
10:1Log Compression
Key Principle
Columnar storage compresses agent log data at roughly 10:1 ratios. Aggregation queries that scan only the columns needed (timestamp, status, latency) skip the columns not needed (request body, response body, headers), reducing I/O by orders of magnitude. Do not choose one database or the other. Use both. The replication cost is negligible compared to the operational benefit.
3. The Observability Challenge
Agent observability is fundamentally different from traditional application monitoring. Agents make autonomous decisions, call tools in unpredictable sequences, and interact with external systems in ways that depend on conversational context. Standard APM tools were not designed for this.
Three-Domain Monitoring
Production agent observability requires monitoring across three domains simultaneously:
Utilization: Resource consumption across all deployment hardware. Token consumption by model provider. Queue depths for pending tasks. These metrics identify capacity constraints before they cause failures.
Faults: Error rates by agent, tool, and API endpoint. Timeout frequency. Provider availability. Security alert severity. These metrics identify active problems.
Operations: Deployment history, configuration changes, credential rotation status, patch levels. These metrics track infrastructure health distinct from the agents running on it.
Request-Level Tracing
Every agent interaction — from initial prompt to final response — must be captured as a trace with spans for each tool call, model invocation, and processing step. This trace data serves dual purposes: real-time latency analysis for operators and compliance artifacts for auditors.
Insight
Auto-generated architecture diagrams from production traces are not just documentation — they are audit artifacts. SOC 2 reviewers need data flow diagrams. Due diligence teams need system architecture. Generating these from live traces satisfies both audiences with zero manual documentation effort.
4. Intelligent Routing
The routing layer is the control plane of managed agent infrastructure. Every request from every agent passes through it, making it the natural enforcement point for security, cost management, and observability.
Multi-Stage Request Processing
Each incoming request passes through a pipeline:
- Authentication: Verify the request comes from a registered agent on authorized hardware.
- Policy enforcement: Confirm the action is permitted by configuration. A marketing agent cannot access financial data. An HR agent cannot modify production systems.
- Provider selection: Route to the optimal model provider based on task complexity, cost, and availability. Simple tasks go to efficient models. Complex reasoning goes to frontier models.
- Content filtering: Scan for prompt injection attempts before the request reaches the model.
- Response capture: Log the complete request-response pair for monitoring and compliance.
Provider Abstraction
Agents should not know or care which model provider serves their requests. The routing layer abstracts multiple providers behind a unified API, enabling cost optimization (route to the cheapest provider that meets quality requirements), automatic failover during outages, and negotiation leverage across providers without code changes.
Key Principle
Revenue comes from infrastructure management, not direct inference. The managed service provider handles routing, monitoring, security, and support. The margin is in the management, not the compute. Granular token metering per agent, per customer, per provider enables usage-based billing and cost attribution at the department level.
5. Security as Infrastructure
AI agents with access to business systems represent a novel threat vector. They can exfiltrate data at machine speed, modify records in bulk, and impersonate authorized users through the tools they control. Traditional security monitoring — designed for human-speed actions — misses agent-speed threats entirely.
Agent security requires purpose-built behavioral analysis that monitors against rolling statistical baselines and flags deviations indicating compromised or misconfigured agents.
The Behavioral Approach
Effective agent security monitors patterns, not individual actions:
- Volume anomalies: An agent suddenly operating at 10x its normal throughput
- Scope anomalies: An agent accessing data outside its established patterns
- Temporal anomalies: Activity outside established operating windows
- Escalation patterns: Gradual privilege creep that individually looks normal but collectively represents a threat
- Coordination detection: Correlated anomalies across multiple agents suggesting coordinated activity
Insight
Security monitoring must be bundled with every deployment, not upsold as a premium tier. An agent deployment without behavioral monitoring is a liability, not a product. Shipping agent infrastructure without security is like shipping a car without brakes — technically possible, professionally negligent.
Co-locating security monitoring with the agents it watches means detection latency is measured in milliseconds. No data leaves the customer's network for security analysis. This is critical when an agent can exfiltrate thousands of records in the time a cloud-based system processes the first alert.
6. Compliance by Design
Every component of production agent infrastructure should map to established trust service criteria. This is not an afterthought — it is a first-class design constraint.
- Security: Behavioral alerts, access logs, injection detection
- Availability: Uptime metrics, failover history, provider health
- Processing Integrity: Request-response traces, tool execution logs
- Confidentiality: Data access patterns, encryption status, credential rotation
- Privacy: PII detection in agent interactions, data retention compliance
When the infrastructure is designed with compliance as a constraint rather than a feature, the evidence is already organized when the auditor arrives. No scramble. No retrofit. The monitoring you need for operations is the same monitoring the auditor needs for certification.
7. Conclusion: Build for the Build
The infrastructure patterns documented here — dual-database observability, multi-stage routing, three-domain monitoring, behavioral security, and compliance-by-design — address the operational requirements that no agent framework handles.
Frameworks give agents capabilities. Infrastructure gives agents reliability, security, observability, and compliance. The market has 40,000+ options for capabilities. The market has almost no options for infrastructure.
The infrastructure layer is the defensible business. Frameworks will consolidate. The survivors will still need the same monitoring, the same security, the same observability, the same routing. Platform-agnostic infrastructure means revenue that grows regardless of which framework wins.
Build for the build. The teams deploying agents need someone managing what is underneath. That is the service. That is the business. That is the playbook.