Skip to main content

14 posts tagged with "llm-engineering"

View all tags

Context Windows Aren't Free Storage: The Case for Explicit Eviction Policies

· 10 min read
Tian Pan
Software Engineer

Most engineering teams treat the LLM context window the way early web developers treated global variables: throw everything in, fix it later. The context is full of the last 40 conversation turns, three entire files from the repository, a dozen retrieved documents, and a system prompt that's grown by committee over six months. It works — until it doesn't, and by then it's hard to tell what's causing the degradation.

The context window is not heap memory. It is closer to a CPU register file: finite, expensive per unit, and its contents directly affect every computation the model performs. When you treat registers as scratch space and forget to manage them, programs crash in creative ways. When you treat context windows as scratch space, LLMs degrade silently and expensively.

When Your Agent Framework Becomes the Bug

· 8 min read
Tian Pan
Software Engineer

High-level agent frameworks promise to turn a three-day integration into a three-hour prototype. That promise is real. The problem is what happens next: six months into production, engineers at a company that builds AI-powered browser testing agents discovered they were spending as much time debugging LangChain as building features. Their fix was radical — they eliminated the framework entirely and went back to modular building blocks. "Once we removed it," they wrote, "we no longer had to translate our requirements into LangChain-appropriate solutions. We could just code."

They are not alone. Roughly 45% of developers who experiment with high-level LLM orchestration frameworks never deploy them to production. Another 23% eventually remove them after shipping. These numbers don't mean frameworks are bad tools — they mean frameworks are tools with a specific useful range, and that range is narrower than the demos suggest.

Tool Docstring Archaeology: The Description Field Is Your Highest-Leverage Prompt

· 11 min read
Tian Pan
Software Engineer

The highest-leverage prompt in your agent is not in your system prompt. It is the one-sentence description you wrote under a tool definition six months ago, committed alongside the implementation, and never touched again. The model reads it on every turn to decide whether to invoke the tool, which arguments to bind, and how to recover when the response doesn't match expectations. Engineers treat it as API documentation for humans. The model treats it as a prompt.

The gap between those two framings is where the worst kind of tool-use bugs live: the model invokes the right function name with the right arguments, and the right API call goes out — but for the wrong reasons, in the wrong situation, or in preference over a better tool sitting next to it. No exception fires. Your eval suite still passes. The regression only shows up as a slow degradation in whatever metric you use to measure whether the agent is actually helping.

The Annotation Pipeline Is Production Infrastructure

· 11 min read
Tian Pan
Software Engineer

Most teams treat their annotation pipeline the same way they treat their CI script from 2019: it works, mostly, and nobody wants to touch it. A shared spreadsheet with color-coded rows. A Google Form routing tasks to a Slack channel. Three contractors working asynchronously, comparing notes in a thread.

Then a model ships with degraded quality, an eval regresses in a confusing direction, and the post-mortem eventually surfaces the obvious: the labels were wrong, and no one built anything to detect it.

Annotation is not a data problem. It is a software engineering problem. The teams that treat it that way — with queues, schemas, monitoring, and structured disagreement handling — build AI products that improve over time. The teams that don't are in a cycle of re-labeling they can't quite explain.

The Context Window Cliff: What Actually Happens When Your Agent Hits the Limit Mid-Task

· 9 min read
Tian Pan
Software Engineer

Your agent completes steps one through six flawlessly. Step seven contradicts step two. Step eight hallucinates a tool that doesn't exist. Step nine confidently submits garbage. Nothing crashed. No error was thrown. The agent simply forgot what it was doing — and kept going anyway.

This is the context window cliff: the moment an AI agent's accumulated context exceeds its effective reasoning capacity. It doesn't fail gracefully. It doesn't ask for help. It makes confidently wrong decisions based on partial information, and you won't know until the damage is done.

The Alignment Tax: When Safety Tuning Hurts Your Production LLM

· 10 min read
Tian Pan
Software Engineer

You fine-tuned your model for safety. Your eval suite shows it refuses harmful requests 98% of the time. Then you deploy it to production — and your medical documentation assistant starts hedging on routine clinical terminology, your legal research tool refuses to summarize case law involving violence, and your code generation pipeline wraps every shell command in three layers of warnings. Completion rate drops 15%. User satisfaction craters. The model is safer and less useful.

This is the alignment tax: the measurable degradation in task performance that safety training imposes on language models. Every team shipping LLM-powered products pays it, but most never quantify it — and fewer still know how to reduce it without compromising the safety properties they need.

The Abstraction Inversion Problem: When AI Frameworks Force You to Think at the Wrong Level

· 9 min read
Tian Pan
Software Engineer

There is a specific moment in every AI agent project where the framework stops helping. You know it when you find yourself spending more time reading the framework's source code than writing your own features — reverse-engineering abstractions that were supposed to save you from complexity but instead became the primary source of it.

This is the abstraction inversion problem: when a framework forces you to reconstruct low-level primitives on top of high-level abstractions that were designed to hide them. The term comes from computer science — it describes what happens when the abstraction layer lacks the escape hatches you need, so you end up building the underlying capability back on top of it, at greater cost and with worse ergonomics than if you had started without the abstraction at all.

In AI engineering, this problem has reached epidemic proportions. Teams adopt orchestration frameworks expecting to move faster, hit a wall within weeks, and then spend months working around the very tool that was supposed to accelerate them.

The N+1 Query Problem Has Infected Your AI Agent

· 10 min read
Tian Pan
Software Engineer

Your AI agent just made twelve API calls to answer a question that needed two. You didn't notice because there's no EXPLAIN ANALYZE for tool calls, no ORM profiler flagging the issue, and the agent got the right answer anyway — just two seconds late and three times over-budget on tokens.

This is the N+1 query problem, and it has quietly migrated from your database layer into your agent's tool call layer. The bad news: the failure mode is identical to what poisoned web applications in the 2010s. The good news: the solutions from that era port almost directly.

The Unit Economics of AI Agents: When Does Autonomous Work Actually Save Money

· 10 min read
Tian Pan
Software Engineer

Your AI agent costs less than you think in development and far more than you think in production. The API bill — the number most teams optimize against — represents roughly 10–20% of the true total cost of running agents in production. The rest is buried in layers that most engineering budgets never explicitly model.

This matters because the decision to ship an agent at scale isn't really a technical decision. It's a unit economics decision. And the teams making that call with incomplete cost models are the same ones reporting negative ROI six months later.

Synthetic Training Data Quality Collapse: How Feedback Loops Destroy Your Fine-Tuned Models

· 10 min read
Tian Pan
Software Engineer

You generate 50,000 synthetic instruction-following examples with GPT-4, fine-tune a smaller model on them, deploy it, and the results look great. Six months later, your team repeats the process — except this time you generate the examples with the fine-tuned model to save costs. The second model's evals are slightly lower, but within noise. You tune the next version the same way. By the fourth iteration, your model's outputs have a strange homogeneity. Users report it sounds robotic. It struggles with anything that doesn't fit a narrow template. Your most capable fine-tune has become your worst.

This is model collapse — the progressive, self-reinforcing degradation that happens when LLMs train on data generated by other LLMs. It is not a theoretical risk. It is a documented failure mode with measurable mechanics, and it is increasingly likely to affect teams that have normalized synthetic data generation without thinking carefully about the feedback dynamics.

Writing Tools for Agents: The ACI Is as Important as the API

· 9 min read
Tian Pan
Software Engineer

Most engineers approach agent tools the same way they approach writing a REST endpoint or a library function: expose the capability cleanly, document the parameters, handle errors. That's the right instinct for humans. For AI agents, it's exactly wrong.

A tool used by an agent is consumed non-deterministically, parsed token by token, and selected by a model that has no persistent memory of which tool it used last Tuesday. The tool schema you write is not documentation — it is a runtime prompt, injected into the model's context at inference time, shaping every decision the agent makes. Every field name, every description, every return value shape is a design decision with measurable performance consequences. This is the agent-computer interface (ACI), and it deserves the same engineering investment you'd put into any critical user-facing interface.

Why Your AI Agent Should Write Code Instead of Calling Tools

· 11 min read
Tian Pan
Software Engineer

Most AI agents are expensive because of a subtle architectural mistake: they treat every intermediate result as a message to be fed back into the model. Each tool call becomes a round trip through the LLM's context window, and by the time a moderately complex task completes, you've paid to process the same data five, ten, maybe twenty times. A single 2-hour sales transcript passed between three analysis tools might cost you 50,000 tokens — not for the analysis, just for the routing.

There's a better way. When agents write and execute code rather than calling tools one at a time, intermediate results stay in the execution environment, not the context window. The model sees summaries and filtered outputs, not raw data. The difference isn't incremental — it's been measured at 98–99% token reductions on real workloads.