Skip to main content

17 posts tagged with "llm-engineering"

View all tags

The LLM-as-Compiler Pattern: Separating Plan Generation from Execution

· 10 min read
Tian Pan
Software Engineer

When a PlanCompiler-style agent is benchmarked against direct LLM-to-code generation on 300 stratified multi-step tasks, the structured approach achieves 92.67% success at $0.00128 per task. The direct approach — where the LLM decides actions step-by-step in a free-form loop — achieves 62% success at $0.0106 per task. That is 50% more accurate at one-eighth the cost.

The difference isn't model capability. Both approaches use the same model. The difference is architecture: one separates plan generation from plan execution; the other conflates them.

Stale Tool Descriptions Are Your Agent's Biggest Silent Failure

· 9 min read
Tian Pan
Software Engineer

You ship a tool that lets your agent fetch user profiles. The description reads: "Retrieves user information by user ID." Six weeks later, the backend team renames user_id to customer_uuid and adds a required tenant_id field. Nobody updates the tool schema. Your agent keeps calling the old signature, gets back a 400, interprets the empty result as "no user found," and helpfully creates a duplicate record.

No error in the logs. No alert fired. The agent was confident the whole time.

This is the tool documentation problem: schema drift that turns stale descriptions into silent failure vectors. It is probably the most underappreciated reliability hazard in production AI systems today, and it gets worse the longer your agent lives.

The AI Feature You Should Not Have Shipped: A Task-Shape Checklist

· 10 min read
Tian Pan
Software Engineer

The demo always works. That is the most expensive sentence in AI product development. The product manager sees the model handle the happy path, the engineer ships the obvious version of the feature, and six weeks later the support queue is full of complaints that the metric did not predict. Nothing in the model regressed. Nothing in the prompt got worse. The feature was simply not the shape the model could do well, and the team did not have a way to say so before the work began.

A meaningful fraction of shipped AI features fail this way — not because the model is bad, but because the task is wrong. The output the product needs is deterministic and the engine is stochastic. The user's tolerance for the tail is one bad answer per thousand and the model's failure distribution is heavier than that. The latency budget the unit economics require is half of what the model can deliver at any tier you can afford. The ground truth required to evaluate quality does not exist and cannot be cheaply created. None of these are model problems. They are task-shape problems, and they should have been screened before the first prompt was written.

Context Windows Aren't Free Storage: The Case for Explicit Eviction Policies

· 10 min read
Tian Pan
Software Engineer

Most engineering teams treat the LLM context window the way early web developers treated global variables: throw everything in, fix it later. The context is full of the last 40 conversation turns, three entire files from the repository, a dozen retrieved documents, and a system prompt that's grown by committee over six months. It works — until it doesn't, and by then it's hard to tell what's causing the degradation.

The context window is not heap memory. It is closer to a CPU register file: finite, expensive per unit, and its contents directly affect every computation the model performs. When you treat registers as scratch space and forget to manage them, programs crash in creative ways. When you treat context windows as scratch space, LLMs degrade silently and expensively.

When Your Agent Framework Becomes the Bug

· 8 min read
Tian Pan
Software Engineer

High-level agent frameworks promise to turn a three-day integration into a three-hour prototype. That promise is real. The problem is what happens next: six months into production, engineers at a company that builds AI-powered browser testing agents discovered they were spending as much time debugging LangChain as building features. Their fix was radical — they eliminated the framework entirely and went back to modular building blocks. "Once we removed it," they wrote, "we no longer had to translate our requirements into LangChain-appropriate solutions. We could just code."

They are not alone. Roughly 45% of developers who experiment with high-level LLM orchestration frameworks never deploy them to production. Another 23% eventually remove them after shipping. These numbers don't mean frameworks are bad tools — they mean frameworks are tools with a specific useful range, and that range is narrower than the demos suggest.

Tool Docstring Archaeology: The Description Field Is Your Highest-Leverage Prompt

· 11 min read
Tian Pan
Software Engineer

The highest-leverage prompt in your agent is not in your system prompt. It is the one-sentence description you wrote under a tool definition six months ago, committed alongside the implementation, and never touched again. The model reads it on every turn to decide whether to invoke the tool, which arguments to bind, and how to recover when the response doesn't match expectations. Engineers treat it as API documentation for humans. The model treats it as a prompt.

The gap between those two framings is where the worst kind of tool-use bugs live: the model invokes the right function name with the right arguments, and the right API call goes out — but for the wrong reasons, in the wrong situation, or in preference over a better tool sitting next to it. No exception fires. Your eval suite still passes. The regression only shows up as a slow degradation in whatever metric you use to measure whether the agent is actually helping.

The Annotation Pipeline Is Production Infrastructure

· 11 min read
Tian Pan
Software Engineer

Most teams treat their annotation pipeline the same way they treat their CI script from 2019: it works, mostly, and nobody wants to touch it. A shared spreadsheet with color-coded rows. A Google Form routing tasks to a Slack channel. Three contractors working asynchronously, comparing notes in a thread.

Then a model ships with degraded quality, an eval regresses in a confusing direction, and the post-mortem eventually surfaces the obvious: the labels were wrong, and no one built anything to detect it.

Annotation is not a data problem. It is a software engineering problem. The teams that treat it that way — with queues, schemas, monitoring, and structured disagreement handling — build AI products that improve over time. The teams that don't are in a cycle of re-labeling they can't quite explain.

The Context Window Cliff: What Actually Happens When Your Agent Hits the Limit Mid-Task

· 9 min read
Tian Pan
Software Engineer

Your agent completes steps one through six flawlessly. Step seven contradicts step two. Step eight hallucinates a tool that doesn't exist. Step nine confidently submits garbage. Nothing crashed. No error was thrown. The agent simply forgot what it was doing — and kept going anyway.

This is the context window cliff: the moment an AI agent's accumulated context exceeds its effective reasoning capacity. It doesn't fail gracefully. It doesn't ask for help. It makes confidently wrong decisions based on partial information, and you won't know until the damage is done.

The Alignment Tax: When Safety Tuning Hurts Your Production LLM

· 10 min read
Tian Pan
Software Engineer

You fine-tuned your model for safety. Your eval suite shows it refuses harmful requests 98% of the time. Then you deploy it to production — and your medical documentation assistant starts hedging on routine clinical terminology, your legal research tool refuses to summarize case law involving violence, and your code generation pipeline wraps every shell command in three layers of warnings. Completion rate drops 15%. User satisfaction craters. The model is safer and less useful.

This is the alignment tax: the measurable degradation in task performance that safety training imposes on language models. Every team shipping LLM-powered products pays it, but most never quantify it — and fewer still know how to reduce it without compromising the safety properties they need.

The Abstraction Inversion Problem: When AI Frameworks Force You to Think at the Wrong Level

· 9 min read
Tian Pan
Software Engineer

There is a specific moment in every AI agent project where the framework stops helping. You know it when you find yourself spending more time reading the framework's source code than writing your own features — reverse-engineering abstractions that were supposed to save you from complexity but instead became the primary source of it.

This is the abstraction inversion problem: when a framework forces you to reconstruct low-level primitives on top of high-level abstractions that were designed to hide them. The term comes from computer science — it describes what happens when the abstraction layer lacks the escape hatches you need, so you end up building the underlying capability back on top of it, at greater cost and with worse ergonomics than if you had started without the abstraction at all.

In AI engineering, this problem has reached epidemic proportions. Teams adopt orchestration frameworks expecting to move faster, hit a wall within weeks, and then spend months working around the very tool that was supposed to accelerate them.

The N+1 Query Problem Has Infected Your AI Agent

· 10 min read
Tian Pan
Software Engineer

Your AI agent just made twelve API calls to answer a question that needed two. You didn't notice because there's no EXPLAIN ANALYZE for tool calls, no ORM profiler flagging the issue, and the agent got the right answer anyway — just two seconds late and three times over-budget on tokens.

This is the N+1 query problem, and it has quietly migrated from your database layer into your agent's tool call layer. The bad news: the failure mode is identical to what poisoned web applications in the 2010s. The good news: the solutions from that era port almost directly.

The Unit Economics of AI Agents: When Does Autonomous Work Actually Save Money

· 10 min read
Tian Pan
Software Engineer

Your AI agent costs less than you think in development and far more than you think in production. The API bill — the number most teams optimize against — represents roughly 10–20% of the true total cost of running agents in production. The rest is buried in layers that most engineering budgets never explicitly model.

This matters because the decision to ship an agent at scale isn't really a technical decision. It's a unit economics decision. And the teams making that call with incomplete cost models are the same ones reporting negative ROI six months later.