12-Factor Agents: A Framework for Building AI Systems That Actually Ship
The teams actually shipping reliable AI agents to production customers are mostly not using agent frameworks. They rolled their own.
That observation, surfaced from conversations with 100+ technical founders, is the uncomfortable starting point for the 12-Factor Agents framework — a manifesto for building LLM-powered software that reaches production instead of languishing at 80% quality forever. The framework is named deliberately after the original 12-Factor App methodology that shaped a generation of web services. The analogy holds: just as the 12-factor app gave teams a principled approach to building deployable web services, 12-factor agents provides the principles for building reliable, observable AI systems.
The 19,000-star GitHub repository documents what the best-performing production teams figured out independently. Here is what they know.
The 80% Ceiling Is Not Random
Before getting into the factors themselves, it is worth understanding the failure pattern the framework is designed to break. It goes like this:
- Decide to build an agent
- Grab a framework to move fast — LangChain, CrewAI, AutoGen
- Reach 70-80% quality bar quickly
- Realize 80% is not good enough for customer-facing features
- Realize that getting past 80% requires reverse-engineering the framework's prompts, state management, and control flow
- Start over from scratch
The ceiling is not a coincidence. It corresponds exactly to where framework abstractions start hiding behavior you need direct control over. The framework's system prompt is optimized for the average case. Its state management makes assumptions that break for your specific workflow. Its control loop has no hook for human approval between tool selection and tool invocation — precisely the moment when the most dangerous actions happen.
Gartner estimates 85% of AI projects fail to reach production. Enterprise analysis puts AI agent-specific failure rates at 88%. Even best-in-class agent solutions achieve below 55% goal completion on real CRM tasks. These numbers are not about model capability — Claude and GPT-4 are technically impressive. They are about the gap between a demo and a production system.
The 12 factors address that gap directly.
The Core Insight: Agents Are Mostly Just Software
Before diving into each factor, one thesis to internalize: the agents that work in production are not what you picture when you hear "AI agent." They are not an LLM sitting in a loop with a bag of tools, autonomously deciding what to do next. They are mostly deterministic code, with LLM steps inserted at precisely the right points to handle the parts that require language understanding or reasoning.
The LLM is used as a structured transformation function: natural language in, typed data structure out. Everything else — routing, retries, state management, human handoffs — is regular software. This is not a limitation. It is the design pattern that makes the system observable, testable, and shippable.
The 12 Factors
1. Natural Language to Tool Calls
The most reliable agent pattern: translate user input into a well-typed tool call, then execute that tool call with deterministic code. "Create a payment link for $750 to Terri" becomes CreatePaymentLink(amount=750, recipient="Terri"). The LLM handles intent understanding. The function handles execution.
This establishes a clean interface boundary. The LLM's job is to produce a valid structured output. The code's job is to dispatch on it. Testing that interface is straightforward — unit tests for the dispatch logic, evals for the LLM's extraction accuracy.
2. Own Your Prompts
Do not outsource prompt engineering to a framework. Some frameworks take a black-box approach: Agent(role="customer support", goal="resolve tickets", tools=[...]). This gives you a reasonable starting prompt but makes it opaque. You cannot audit what exact tokens go into the model, you cannot write regressions for prompt behavior, and you cannot tune it without reverse-engineering the framework internals.
Treat prompts as first-class code. Version control them. Write evals for them. Own every token. This is more work upfront but pays dividends when production issues appear — and they will.
3. Own Your Context Window
At any given point in an agent loop, you are sending the model: system instructions, retrieved documents, past tool calls and results, memory from related sessions, and output format instructions. This is "context engineering" — the discipline that emerged in 2025 and is now considered as important as prompt engineering.
The factor argues you should not be locked into the standard message array format. Custom context structures — often XML or purpose-built formats packed into a single user message — can be more token-efficient and produce better reasoning quality for your specific use case. IBM documented a workflow that consumed 20 million tokens and failed repeatedly; the same workflow with compressed memory pointers used 1,234 tokens and succeeded. Context engineering was the only change.
4. Tools Are Just Structured Outputs
Tool calling is not magic. It is the model producing a JSON object that matches a schema. Your code dispatches on it. This demystification matters practically: you can use plain structured outputs even when a framework's native tool calling is misbehaving; you can add validation before dispatch; you can unit-test the dispatch logic independently of the LLM.
When you see tool calls as structured outputs, you can also build custom dispatch that does things frameworks do not: rate-limiting specific tools, adding human approval gates for particular call signatures, or routing the same tool call to different implementations based on environment.
5. Unify Execution State and Business State
Many agent systems maintain two state stores: execution state (which step are we on, retry count, waiting status) and business state (conversation history, tool call results). State divergence between these two is a common source of production bugs — the agent believes it has completed step 3, but the database says step 2. Recovery is a nightmare.
The insight: execution state can almost always be inferred from the event log of what has happened. Make your event log the single source of truth. The "state" passed to each LLM call is just a view over that log. This makes agent threads trivially serializable, debuggable, and resumable from any point.
6. Launch, Pause, and Resume with Simple APIs
You should be able to start an agent with an API call, suspend it when waiting on a slow external operation (a human response, a long-running job, a 24-hour batch process), and resume it via a webhook — without restarting from the beginning.
The critical detail: pause must be possible between tool selection and tool invocation, not just between full agent turns. That gap — after the model decides to take a high-stakes action, before the action executes — is exactly where human approval matters most. Frameworks that do not support pausing at that point cannot safely be given write access to production systems.
7. Contact Humans with Tool Calls
Human-in-the-loop is not an architectural afterthought. It is a tool call. RequestHumanApproval(action="delete_customer_record", context="...", urgency="high") is structurally identical to QueryDatabase(sql="..."). The agent invokes it, the thread serializes, a Slack or email notification fires, and execution waits for the webhook response.
This makes human oversight first-class: auditable, testable, and composable with all other factors. The alternative — ad-hoc "should I ask a human here?" logic bolted onto a framework — is brittle and invisible to observability tooling.
8. Own Your Control Flow
Write your own agent loop. The while-loop that calls the LLM, handles tool dispatch, and appends results is short enough to fit in a screen. It is also important enough that you need to control it directly.
Custom control flow lets you: add breakpoints between tool selection and invocation, implement LLM-as-judge on structured outputs before dispatch, add context window compaction when approaching limits, instrument per-step latency and cost, implement client-side rate limiting, and add durable sleep without holding the process open.
This is consistently the most-requested feature from framework users: the ability to interrupt a running agent at a specific point and resume later. It is trivial to implement when you own the loop. It is often impossible to retrofit into a framework you do not control.
9. Compact Errors into Context Window
When a tool call fails, do not just retry silently. Catch the exception, format it clearly, and append it to the context window before the next LLM call. LLMs are surprisingly good at reading error messages and adjusting on the next attempt — but only if they can see the error.
Additionally: implement a consecutive error counter. After three sequential failures on the same tool, break the loop and escalate to a human via Factor 7. Agents that spin on a failing tool call — burning tokens, running up API costs, accumulating in logs — are a documented production failure mode. Explicit error compaction with a ceiling turns a silent spin into a recoverable, observable event.
10. Small, Focused Agents
Build agents that do one thing well, ideally in 3-10 steps. Databricks Mosaic research found that model correctness starts dropping after 32K tokens of context, with agents favoring repetitive actions from their growing history over correct next steps. As context grows, focus degrades.
The practical design implication: decompose complex workflows into a chain of small specialized agents rather than one monolithic agent. Each small agent is easier to evaluate, easier to debug, and easier to improve. As models get better, you can slowly expand scope rather than doing a rewrite.
11. Trigger from Anywhere, Meet Users Where They Are
If you have implemented Factor 6 (pause/resume) and Factor 7 (human contact as tool call), your agent is already channel-agnostic. Triggering from Slack, email, webhook, or cron job is just a different transport calling the same launch API. Responding back on the same channel is just routing the webhook response.
This unlocks "outer loop" agent patterns: agents that run for 5-90 minutes on a complex task, then surface at a critical decision point wherever the user happens to be — without requiring the user to stay in a chat window.
12. Make Your Agent a Stateless Reducer
The culminating pattern. An agent is a pure function: given the current event log and a new input, it returns the updated event log. This is foldl applied to agent design. Each step: (state, event) → new_state.
Stateless reducers are trivially testable. You can write a unit test by constructing an input state and asserting on the output. You can replay any production run from its event log. You can fork a thread at any point for debugging or A/B testing. The entire agent system becomes deterministic from the outside, even though individual LLM calls are not.
What This Actually Looks Like in Practice
These factors are not independent checklist items. They compose. An agent that owns its context window (Factor 3) can implement error compaction (Factor 9) properly. An agent with a unified state model (Factor 5) can trivially implement pause/resume (Factor 6). A stateless reducer (Factor 12) makes human handoffs (Factor 7) clean because there is no in-memory state to preserve.
The reference implementation is a loop you can hold in your head: receive an event, construct context from the event log, call the LLM, get a structured output, validate it, dispatch it, catch any errors and append them to the log, check if the result is a human-contact tool, pause and serialize if so, otherwise loop.
That loop, with these 12 principles applied, is what the teams actually shipping agents to production have converged on independently — before anyone wrote them down.
The Framework Is Not Anti-Framework
A clarification worth making: 12-Factor Agents does not say "never use LangGraph, never use LlamaIndex." LangGraph in particular aligns well with several factors — its explicit graph structure supports owned control flow (Factor 8), and its state management supports event-log unification (Factor 5). The framework's point is about principles, not library choices.
What it does say: understand which principles your framework respects and which it violates. If you cannot see your prompts, that is Factor 2 violated. If you cannot pause between tool selection and invocation, that is Factor 6 violated. Know the contract you are signing when you adopt an abstraction.
The Signal Under the Noise
The AI agent space generates enormous amounts of benchmark noise — demo videos of impressive-looking workflows that fail at the first real user. The 12-Factor Agents framework is a different kind of signal: principles extracted from teams that shipped and maintained production systems, rather than teams that got a compelling demo working.
The 80% ceiling is real. The solution is not a better framework. It is owning the parts of the system that are too important to delegate: your prompts, your context, your control flow, and your state. The factors are not prescriptions for any particular implementation — they are the properties your implementation needs to have for production to be survivable.
Most of them are also just good software engineering applied to a new class of system. Which is exactly the point.
- https://www.humanlayer.dev/blog/12-factor-agents
- https://news.ycombinator.com/item?id=43699271
- https://www.youtube.com/watch?v=8kMaTybvDUw
- https://www.anthropic.com/engineering/building-effective-agents
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://dzone.com/articles/understanding-twelve-factor-agents
- https://www.ikangai.com/12-factor-agents-a-blueprint-for-reliable-llm-applications/
- https://www.digitalapplied.com/blog/88-percent-ai-agents-never-reach-production-failure-framework
- https://www.dbreunig.com/2025/12/06/the-state-of-agents.html
- https://12factor.net
