Skip to main content

When Your Agent Framework Becomes the Bug

· 8 min read
Tian Pan
Software Engineer

High-level agent frameworks promise to turn a three-day integration into a three-hour prototype. That promise is real. The problem is what happens next: six months into production, engineers at a company that builds AI-powered browser testing agents discovered they were spending as much time debugging LangChain as building features. Their fix was radical — they eliminated the framework entirely and went back to modular building blocks. "Once we removed it," they wrote, "we no longer had to translate our requirements into LangChain-appropriate solutions. We could just code."

They are not alone. Roughly 45% of developers who experiment with high-level LLM orchestration frameworks never deploy them to production. Another 23% eventually remove them after shipping. These numbers don't mean frameworks are bad tools — they mean frameworks are tools with a specific useful range, and that range is narrower than the demos suggest.

The Abstraction Tax Nobody Quotes in the README

Every framework imposes a tax. The question is whether the abstractions it provides are worth what you pay for them.

In the early phases of a project, the answer is usually yes. You get retrieval chains, memory management, tool calling, and agent loops in a few lines. The framework handles the boilerplate you don't want to write yet. Iteration is fast.

The tax becomes visible when production arrives. The framework's opaque internal machinery — hidden retry logic, automatic context windowing, behind-the-scenes prompt injection — starts behaving in ways you can no longer reason about. And because the machinery is hidden, when it goes wrong, the error surfaces in your code while the root cause lives three abstraction layers down.

Three specific categories of failure repeat across teams:

Hidden retry amplification. Many frameworks implement retry logic automatically, without token budget awareness or jitter. A naive retry on a 4,000-token prompt generates 12,000 tokens if it retries twice. In multi-step pipelines, the math compounds: three retries at each layer of a five-service chain produces 243 backend calls for every original user request. One team discovered this when their API spend jumped from $127/week to $47,000/week after an agent loop ran uncontrolled for eleven days. The framework's retry logic had no circuit breaker, no budget cap, no alerting surface.

Context window invisibility. Frameworks that manage conversation history for you tend to be generous — they keep everything. By message ten in a multi-turn conversation, you may be sending 40,000 tokens to receive a 100-token response. Teams that replaced a framework's default memory component with custom implementations routinely report 30% cost reductions with no quality degradation. The framework's default wasn't wrong, exactly — it just optimized for not losing information rather than for cost, and it never told you it was doing so.

Debugging that requires reading the framework's source code. When LangChain's LCEL pipe operator routes execution through its internal invoke() machinery, there is no natural insertion point for standard Python logging between steps. Engineers end up adding print statements to the framework's own source code to trace what's happening. This is a sign that the abstraction has inverted: instead of making your work easier, it is making the framework's internal work visible to you at the wrong level.

The Signals That Tell You to Move Down the Stack

These problems don't appear all at once. They accumulate. The early warnings are easy to miss because each one looks like a configuration issue rather than an architectural one. Here's what to watch for:

Cost explodes beyond estimates. If your actual API spending is consistently 2–3× your back-of-the-envelope estimates, the framework is injecting tokens you're not accounting for. This includes system prompt templates injected automatically, tool schemas included regardless of whether the tool is relevant, and verbose internal chain descriptions added to support debugging output.

Debugging requires framework-specific knowledge. When a failure occurs and you can't diagnose it using general Python debugging skills — when you have to read the framework's source, understand its internals, or grep through its issues tracker — the abstraction is no longer helping you. You're paying the overhead cost without getting the productivity benefit.

Your requirements don't fit the framework's model. The team mentioned earlier needed to spawn sub-agents dynamically, coordinate specialist agents, and observe agent state during execution. LangChain's chain model didn't accommodate these patterns cleanly. Every new capability required translating their mental model into LangChain-appropriate abstractions, which added friction rather than removing it.

Performance doesn't improve under optimization. One benchmarking study found that CrewAI consumes nearly twice the tokens of other frameworks and takes over three times as long, because its multi-agent pattern injects verbose role descriptions, goal statements, and internal monologues for every agent invocation. If latency remains stubbornly high after obvious prompt optimization, the framework overhead may be the floor you can't optimize through.

Upgrades break things unpredictably. Frameworks that moved fast on their way to 1.0 introduced breaking changes across releases. If upgrading the framework version requires auditing changes across your entire application, you've accumulated invisible coupling that the abstraction was supposed to prevent.

What Moving Down the Stack Actually Looks Like

"Dropping to lower abstraction" is not a single action. It's a spectrum. The goal isn't to rewrite everything from scratch — it's to find the level where your team can reason clearly about what's happening and control what needs controlling.

Start with the most painful seam. If hidden retry logic is your problem, the fix is not a full framework migration — it's replacing the one component that manages retries with explicit, observable code. Most frameworks expose enough escape hatches to do this without abandoning the rest of the framework.

Make token costs observable before changing anything. Before optimizing, instrument. You need to know, per request type, exactly how many tokens the system prompt contributes, how many come from tool schemas, how many from conversation history, and how many from the actual user content. Without this breakdown, any optimization is guesswork. OpenTelemetry-compatible LLM tracing libraries can attach token counts to spans with minimal code changes.

Replace memory management with explicit state. Framework-managed conversation history is almost never the right default for production. Write your own summarization logic, or implement a sliding window, or store session state in Redis and reconstruct it deliberately. The explicit version is more code, but it's code you understand and can test.

Build toward direct SDK calls for critical paths. The Anthropic and OpenAI SDKs are well-designed. Many patterns — prompt chaining, structured output, tool calling, streaming — can be implemented in 20–50 lines against the SDK directly. For latency-critical or cost-critical paths, this is often the right answer. You get predictable behavior, clear error surfaces, and full control over retries, timeouts, and context construction.

The Migration That Doesn't Require Starting Over

Migrating away from a high-level framework is easier if you treat it as a seam-by-seam replacement rather than a big-bang rewrite.

The general pattern:

  1. Add observability first. Instrument the system to capture token usage, latency per step, and error surfaces before touching anything else. This tells you what's actually expensive.

  2. Replace one component at a time. Start with memory management (usually the most opaque), then retry/error handling, then prompt construction. Each replacement should be independently testable.

  3. Keep the framework in place for non-critical paths. Rapid prototyping benefits from framework abstractions. You don't need to migrate your internal admin tools and experimental features — focus on the production paths that serve users.

  4. Use typed interfaces at boundaries. Pydantic models for inputs and outputs, at every component boundary, give you a migration checkpoint. When a component is replaced, the interface stays stable.

  5. Test the behavior you're preserving, not the code. Don't write unit tests that check implementation details of the old framework. Write behavioral tests that assert what the system should do: given this user query, does the response include correct information? Does it call the expected tool? Does it stay within budget?

For teams that want a framework but a lighter one, type-safe, Python-native alternatives like Pydantic AI have emerged that support multiple providers, have built-in test injection, and don't inject tokens you didn't ask for. They're not magic — you still need to understand what's happening — but they impose a much smaller tax for the scaffolding they provide.

When Frameworks Are the Right Answer

None of this means frameworks are wrong. They are right in specific conditions:

  • When the team is exploring and speed of iteration beats operational control
  • When the framework's built-in integrations cover your actual use case without customization
  • When the cost of the framework overhead is genuinely acceptable for your traffic volume and latency requirements
  • When your team has the bandwidth to understand the framework's internals before relying on them

The mistake isn't using a framework. The mistake is keeping one past the point where it costs more to operate than to replace. The cost of staying is invisible (absorbed into debugging time, unexplained API bills, and architectural constraints that appear as slowness) while the cost of migrating is visible and up-front. That asymmetry is why teams hold on too long.

The question isn't "should I use a framework?" It's "which layer should each part of my system live at?" The answer will be different for your prototype, your production API, and your internal tooling — and it will change as your system matures. Build in the escape hatches now. You'll use them.

References:Let's stay in touch and Follow me for more thoughts and updates