The Most Consequential Architecture Decision of the Year
I’m making the build vs. buy decision for our agent infrastructure stack right now, and it’s the most consequential architecture decision I’ll make this year. Get it wrong, and we either waste 6 months building something a vendor does better, or we lock ourselves into a platform that doesn’t meet our needs 12 months from now.
After evaluating 8 vendors, talking to 12 CTOs who’ve already made this decision, and running 3 internal POCs, I want to share the framework I’m using.
The Decision Matrix
I’ve broken the agent infrastructure stack into 6 components and evaluated each on the build vs. buy spectrum:
| Component | Build | Buy | Our Decision | Reasoning |
|---|---|---|---|---|
| Agent orchestration (framework, tool calling, planning) | Open-source framework + customization | Commercial agent platform | Build (on OSS) | Core differentiator. Our agent logic is our competitive advantage. |
| Observability (tracing, monitoring, alerting) | Custom dashboards on existing monitoring | Agent-specific observability platform | Buy | Not our core competency. The specialized tooling is worth the cost. |
| Evaluation (testing, scoring, benchmarking) | Internal eval framework | Evaluation platform/service | Build | Too domain-specific to buy off-the-shelf. Our eval criteria are unique to our business. |
| Security & guardrails (permissions, audit, safety) | Extend existing security infrastructure | Agent security platform | Hybrid | Core security from existing stack, agent-specific guardrails from vendor. |
| Connectors (integrations to enterprise systems) | Custom integrations | Connector platform | Buy | Undifferentiated heavy lifting. Don’t build Salesforce connectors from scratch. |
| Inference (model serving, routing, cost optimization) | Self-hosted models + API providers | Inference gateway/router | Buy | The inference layer is commoditizing. Let someone else optimize the cost/latency tradeoff. |
The Framework: Four Questions to Ask
For each component, I ask four questions:
1. Is this a competitive differentiator?
If the component directly impacts your product’s unique value proposition, build it. You need to own the thing that makes you different. For most companies, that’s the agent logic and the evaluation criteria — because those encode your domain knowledge.
If the component is commodity infrastructure that every company needs the same way, buy it. Observability, connectors, and inference are commodities. You don’t gain competitive advantage from building a better tracing system.
2. Does the vendor ecosystem have mature options?
For agent orchestration: Yes — LangChain, CrewAI, AutoGen, Semantic Kernel are all viable open-source options. You don’t need to build from zero.
For observability: Getting there — Sentrial, Langfuse, LangSmith, Arize are all in-market. None is fully mature, but they’re further along than what you’d build in-house.
For evaluation: No — the tools are too generic. Every team I’ve talked to ends up building custom eval regardless of what they bought.
For security: Partially — Cascade and others address specific concerns, but comprehensive agent security platforms don’t exist yet. You’ll need to combine vendor solutions with internal infrastructure.
3. What’s the migration cost if the vendor fails or pivots?
This is where Carlos’s vendor risk analysis is critical. For each vendor component:
- How tightly coupled is the integration?
- Is there a standard data format for the component’s output? (OpenTelemetry for traces, for example)
- What’s the estimated effort to swap vendors?
- Is there an open-source fallback?
I won’t buy any agent infrastructure component where the migration cost exceeds 2 engineering-months. That’s my red line.
4. What’s the total cost of ownership over 3 years?
Build costs are always underestimated. Buy costs are always more transparent. But the real comparison is:
Build TCO = Initial development + maintenance + opportunity cost of engineers not working on product
Buy TCO = License fees + integration effort + vendor management overhead + migration risk premium
For observability, our estimate: Build = $380K over 3 years (2 engineers x 50% time). Buy = $120K over 3 years. Clear buy.
For evaluation, our estimate: Build = $200K over 3 years (1 engineer x 60% time + compute). Buy = $90K + $150K customization. Net neutral, but the build version is more fit-for-purpose.
The Composability Requirement
The most important architectural principle: the stack must be composable. Each component should be replaceable independently. This means:
- Standardized interfaces between components (not proprietary APIs)
- Data portability at every layer (traces, evaluations, audit logs should be exportable in open formats)
- No vendor lock-in at the orchestration layer (the agent framework should be agnostic to the observability, security, and inference choices)
I’m designing the architecture as a set of loosely coupled services with well-defined contracts between them. If a vendor disappoints, I can swap that layer without rewriting the rest of the stack.
What I’m Watching
The consolidation play. One or more of the current agent infrastructure vendors will try to become the “Salesforce of agents” — a full platform that does everything. When that happens, the build vs. buy calculus changes because platform integration reduces operational overhead.
Open-source convergence. If the open-source community standardizes on agent observability formats (similar to how OpenTelemetry standardized distributed tracing), the buy case for observability weakens and the build case strengthens.
Hyperscaler entry. When AWS, Azure, and GCP ship native agent infrastructure services (and they will), the independent vendor market gets squeezed. Timing that entry and evaluating whether to wait is part of the decision.
For other CTOs and technical leaders: what’s your build vs. buy split for agent infrastructure? Are you betting on independent vendors, hyperscalers, or rolling your own?