Skip to main content

21 posts tagged with "llm-engineering"

View all tags

The Alignment Tax: When Safety Tuning Hurts Your Production LLM

· 10 min read
Tian Pan
Software Engineer

You fine-tuned your model for safety. Your eval suite shows it refuses harmful requests 98% of the time. Then you deploy it to production — and your medical documentation assistant starts hedging on routine clinical terminology, your legal research tool refuses to summarize case law involving violence, and your code generation pipeline wraps every shell command in three layers of warnings. Completion rate drops 15%. User satisfaction craters. The model is safer and less useful.

This is the alignment tax: the measurable degradation in task performance that safety training imposes on language models. Every team shipping LLM-powered products pays it, but most never quantify it — and fewer still know how to reduce it without compromising the safety properties they need.

The Abstraction Inversion Problem: When AI Frameworks Force You to Think at the Wrong Level

· 9 min read
Tian Pan
Software Engineer

There is a specific moment in every AI agent project where the framework stops helping. You know it when you find yourself spending more time reading the framework's source code than writing your own features — reverse-engineering abstractions that were supposed to save you from complexity but instead became the primary source of it.

This is the abstraction inversion problem: when a framework forces you to reconstruct low-level primitives on top of high-level abstractions that were designed to hide them. The term comes from computer science — it describes what happens when the abstraction layer lacks the escape hatches you need, so you end up building the underlying capability back on top of it, at greater cost and with worse ergonomics than if you had started without the abstraction at all.

In AI engineering, this problem has reached epidemic proportions. Teams adopt orchestration frameworks expecting to move faster, hit a wall within weeks, and then spend months working around the very tool that was supposed to accelerate them.

The N+1 Query Problem Has Infected Your AI Agent

· 10 min read
Tian Pan
Software Engineer

Your AI agent just made twelve API calls to answer a question that needed two. You didn't notice because there's no EXPLAIN ANALYZE for tool calls, no ORM profiler flagging the issue, and the agent got the right answer anyway — just two seconds late and three times over-budget on tokens.

This is the N+1 query problem, and it has quietly migrated from your database layer into your agent's tool call layer. The bad news: the failure mode is identical to what poisoned web applications in the 2010s. The good news: the solutions from that era port almost directly.

The Unit Economics of AI Agents: When Does Autonomous Work Actually Save Money

· 10 min read
Tian Pan
Software Engineer

Your AI agent costs less than you think in development and far more than you think in production. The API bill — the number most teams optimize against — represents roughly 10–20% of the true total cost of running agents in production. The rest is buried in layers that most engineering budgets never explicitly model.

This matters because the decision to ship an agent at scale isn't really a technical decision. It's a unit economics decision. And the teams making that call with incomplete cost models are the same ones reporting negative ROI six months later.

Synthetic Training Data Quality Collapse: How Feedback Loops Destroy Your Fine-Tuned Models

· 10 min read
Tian Pan
Software Engineer

You generate 50,000 synthetic instruction-following examples with GPT-4, fine-tune a smaller model on them, deploy it, and the results look great. Six months later, your team repeats the process — except this time you generate the examples with the fine-tuned model to save costs. The second model's evals are slightly lower, but within noise. You tune the next version the same way. By the fourth iteration, your model's outputs have a strange homogeneity. Users report it sounds robotic. It struggles with anything that doesn't fit a narrow template. Your most capable fine-tune has become your worst.

This is model collapse — the progressive, self-reinforcing degradation that happens when LLMs train on data generated by other LLMs. It is not a theoretical risk. It is a documented failure mode with measurable mechanics, and it is increasingly likely to affect teams that have normalized synthetic data generation without thinking carefully about the feedback dynamics.

Writing Tools for Agents: The ACI Is as Important as the API

· 9 min read
Tian Pan
Software Engineer

Most engineers approach agent tools the same way they approach writing a REST endpoint or a library function: expose the capability cleanly, document the parameters, handle errors. That's the right instinct for humans. For AI agents, it's exactly wrong.

A tool used by an agent is consumed non-deterministically, parsed token by token, and selected by a model that has no persistent memory of which tool it used last Tuesday. The tool schema you write is not documentation — it is a runtime prompt, injected into the model's context at inference time, shaping every decision the agent makes. Every field name, every description, every return value shape is a design decision with measurable performance consequences. This is the agent-computer interface (ACI), and it deserves the same engineering investment you'd put into any critical user-facing interface.

Why Your AI Agent Should Write Code Instead of Calling Tools

· 11 min read
Tian Pan
Software Engineer

Most AI agents are expensive because of a subtle architectural mistake: they treat every intermediate result as a message to be fed back into the model. Each tool call becomes a round trip through the LLM's context window, and by the time a moderately complex task completes, you've paid to process the same data five, ten, maybe twenty times. A single 2-hour sales transcript passed between three analysis tools might cost you 50,000 tokens — not for the analysis, just for the routing.

There's a better way. When agents write and execute code rather than calling tools one at a time, intermediate results stay in the execution environment, not the context window. The model sees summaries and filtered outputs, not raw data. The difference isn't incremental — it's been measured at 98–99% token reductions on real workloads.

How AI Agents Actually Learn Over Time

· 8 min read
Tian Pan
Software Engineer

Most teams building AI agents treat the model as a fixed artifact. You pick a foundation model, write your prompts, wire up some tools, and ship. If the agent starts making mistakes, you tweak the system prompt or switch to a newer model. Learning, in this framing, happens upstream—at the AI lab, during pretraining and RLHF—not in your stack.

This is the wrong mental model. Agents that improve over time do so at three distinct architectural layers, and only one of them involves touching model weights. Teams that understand this distinction build systems that compound in quality; teams that don't keep manually patching the same failure modes.

Measuring AI Agent Autonomy in Production: What the Data Actually Shows

· 7 min read
Tian Pan
Software Engineer

Most teams building AI agents spend weeks on pre-deployment evals and almost nothing on measuring what their agents actually do in production. That's backwards. The metrics that matter—how long agents run unsupervised, how often they ask for help, how much risk they take on—only emerge at runtime, across thousands of real sessions. Without measuring these, you're flying blind.

A large-scale study of production agent behavior across thousands of deployments and software engineering sessions has surfaced some genuinely counterintuitive findings. The picture that emerges is not the one most builders expect.