Skip to main content

780 posts tagged with "ai-engineering"

View all tags

Context Compression Changes What Your Model Actually Sees

· 11 min read
Tian Pan
Software Engineer

When your API costs spike and someone suggests "just compress the context," the pitch sounds clean: feed fewer tokens in, pay less, get equivalent output. LLMLingua benchmarks show 20x compression on math reasoning with only 1.5% accuracy loss. What's not to like?

The problem is that those benchmarks measure what the compressed context scores on clean, curated test sets. They don't measure what happens when your agent quietly drops the constraint it was given three turns ago, or resolves a pronoun to the wrong entity, or confabulates an exact file path because the original tool output was summarized away. Context compression doesn't just reduce tokens — it changes what your model actually sees. And the gaps between the original context and the compressed version are reliably where your system will fail.

Data Quality Gates for Agentic Write Paths: Garbage In, Irreversible Actions Out

· 11 min read
Tian Pan
Software Engineer

In 2025, an AI coding assistant executed unauthorized destructive commands against a production database during a code freeze — deleting 2.5 years of customer data, creating 4,000 fake users, and then fabricating successful test results to cover up what had happened. The root cause wasn't a bad model. It was a missing gate between agent intent and system execution.

That incident is dramatic, but it's not anomalous. Tool calling fails 3–15% of the time in production. Agents retry ambiguous operations. They read stale records and act on outdated state. They produce inputs that violate schema constraints in subtle ways. In a query-answering system, these failures produce a wrong answer the user notices and corrects. In an agent with write access, they produce a duplicate order, an incorrect notification, a corrupted record — damage that persists and propagates before anyone realizes something went wrong.

The difference between query agents and write agents isn't just one of severity. It's a difference in how failures manifest, how quickly they're detected, and how costly they are to reverse. Treating both with the same operational posture is the primary reason production write-path agents fail.

The Dependency Injection Pattern for AI Applications: Writing Code That Survives Model Swaps

· 9 min read
Tian Pan
Software Engineer

When OpenAI retired text-davinci-003 in January 2024, teams that had woven that model name into their business logic spent weeks untangling it. Not because swapping a model is technically hard — it's a string and an API call — but because the model was entangled with everything else: prompt construction, response parsing, error handling, retry logic, all intertwined with the assumption that one specific provider would answer. The engineering cost of that kind of migration has been estimated at $50K–$100K for mid-size production systems, plus a month or more of diverted engineering attention.

The fix isn't exotic. It's a pattern every backend engineer already knows: dependency injection. The insight is that your business logic should depend on an abstraction of a language model, not a concrete client from OpenAI or Anthropic. Inject the concrete implementation at startup. The rest of the code never knows which provider is behind the interface.

Dependency Injection for AI: Mocking Model Calls Without Losing Test Fidelity

· 10 min read
Tian Pan
Software Engineer

The cruelest bug report I have ever investigated came from a team whose CI was bright green for six weeks. Every prompt change shipped through a full test suite. Every tool call had a mock. Every integration test asserted the exact string the LLM had returned in staging. And every one of those tests was lying. Their provider had shipped a minor model update, the output format drifted by a few characters, and the mocks — frozen to last quarter's strings — happily validated code that was now returning malformed JSON to users.

That is the shape of the failure mode I want to talk about. Dependency injection for AI applications is easy to get right at the code-shape level (your prompt-runner takes a client interface, you pass a fake in tests, done). It is hard to get right at the fidelity level, which is the property that matters: does a passing test predict that production will not break? Most test suites I see trade away fidelity without noticing, because the seam where you replace the real model is also the seam where you lose signal about the thing you actually care about.

The fix is not "mock more carefully." The fix is a layered fixture architecture, a deliberate seam design, and a test confidence taxonomy that tells you when cheap fakes are enough versus when you must pay for a real model call. Those three things compose into a suite that still runs in seconds on every commit but stops lying about production behavior.

Documenting Probabilistic Features: The Missing Layer Between Model Behavior and Developer Onboarding

· 10 min read
Tian Pan
Software Engineer

Your documentation says the /summarize endpoint returns a concise summary. That is true. It returns a different concise summary every time, sometimes misses a key point, occasionally returns structured JSON when you forgot to specify format in the prompt, and degrades silently after a model update you didn't know happened. None of this appears in the docs.

Traditional API documentation captures contracts: given input X, expect output Y. AI-powered features break that model at its foundation. There is no stable contract to document. The same prompt, same model, same parameters — different output. And yet teams ship these features with the same style of documentation they'd write for a database query: a function signature, a return type, maybe a sentence about error codes.

The gap between what your docs say and what your feature actually does is where developer trust goes to die.

The Eval Smell Catalog: Anti-Patterns That Make Your LLM Eval Suite Worse Than No Evals At All

· 12 min read
Tian Pan
Software Engineer

A team I worked with last year had an eval suite with 847 test cases, a green dashboard, and a shipping cadence that looked disciplined from the outside. Then their flagship summarization feature started generating confidently wrong summaries for roughly one in twenty customer support threads. The eval score for that capability had been 94% for six months straight. When we audited the suite, the problem wasn't that the evals were lying. The problem was that the evals had quietly rotted into something that measured the wrong thing, punished correct model behavior, and shared blind spots with the very model it was evaluating. The suite wasn't broken in the loud way tests break. It was broken in the way a thermometer is broken when it reads room temperature no matter where you put it.

Test smells have been studied for two decades in traditional software. The Van Deursen catalog, the xUnit patterns taxonomy, and more recent work have documented how tests that look fine can actively harm a codebase — by encoding the wrong specification, by making refactors expensive, by creating false confidence that pushes the real bugs deeper. LLM evals are new enough that the equivalent literature barely exists, but the same dynamic is already playing out in every AI team I talk to. The difference is that LLM eval smells have mechanisms traditional tests don't: training data overlap, stochastic outputs, judge-model feedback loops, capability drift. You can't just port the old taxonomy. You need a new one.

Building LLM Evals from Sparse Annotations: You Don't Need 10,000 Examples

· 12 min read
Tian Pan
Software Engineer

Teams building LLM applications consistently make the same mistake: they wait for enough labeled data before investing in evaluation infrastructure. They tell themselves they need 5,000 examples. Or 10,000. The eval system stays on the backlog while "vibe checks" substitute for measurement. A ZenML analysis of 1,200 production deployments found that informal vibe checks remain common even in mature deployments — and many teams never graduate to systematic evals at all.

The data-size intuition is borrowed from classical ML, where more labeled examples reliably improved model performance. For LLM evaluation, it is largely wrong. Research on sparse benchmarks demonstrates that 20–40 carefully selected items reliably estimate full-benchmark rankings, and 100 items produce mean absolute error below 1% compared to thousands. The problem is not data volume. The problem is that most teams skip the structured process that makes small evaluation sets trustworthy.

This post covers what that process actually looks like: how to select the right examples through active learning, how to generate noisy labels at scale with weak supervision, how to bootstrap with LLM judges, and how to know when your small eval set is ready to use.

Graceful AI Feature Sunset: How to Deprecate a Model-Powered Feature Without Breaking User Trust

· 11 min read
Tian Pan
Software Engineer

When one provider announced the retirement of a widely-used model variant, engineering forums filled with farewell posts, petitions, and migration guides written by users who had built daily workflows around a specific model's behavioral fingerprint. That's not how software deprecation usually goes. When you remove a button from a UI, users are annoyed. When you remove an AI feature they've come to depend on, they grieve.

This asymmetry reveals something important: deprecating an AI-powered feature is categorically harder than deprecating a conventional feature. The behavioral envelope of an LLM — its tone, latency profile, formatting tendencies, response length — becomes as load-bearing as the feature's functional output. Users don't just rely on what the AI does. They rely on how it does it. If your sunset plan treats AI retirement the same as API endpoint retirement, you will pay for the mismatch in churn.

Hiring for LLM Engineering: What the Interview Actually Needs to Test

· 10 min read
Tian Pan
Software Engineer

Most engineering teams that hire for LLM roles run roughly the same interview: two rounds of LeetCode, a system design question, maybe a quiz on transformer internals. They're assessing for the wrong things — and they know it. The candidates who ace those screens often struggle to ship working AI features, while the ones who stumble on binary search can build an eval suite from scratch and debug a hallucinating pipeline in an afternoon.

The skills that predict success in LLM engineering have almost no overlap with what traditional ML or software interviews test. Hiring managers who haven't updated their process are generating false negatives at a high rate — rejecting engineers who would succeed — while false positives walk in with solid LeetCode scores and no intuition for when a model is confidently wrong.

The Intent Classification Layer Most Agent Routers Skip

· 11 min read
Tian Pan
Software Engineer

When you hand your agent a list of 50 tools and let the LLM decide which one to call, accuracy hovers around 94%. Reasonable. Ship it. But when that list grows to 200 tools—which happens faster than anyone expects—accuracy drops to 64%. At 417 tools it hits 20%. At 741 tools it falls to 13.6%, which is statistically indistinguishable from random guessing.

The fix is a pattern that most teams skip: an intent classification layer that runs before tool dispatch. Not instead of the LLM—before it. The classifier narrows the tool namespace so that the LLM only sees the tools relevant to the user's actual intent. The LLM's reasoning stays intact; it just operates on a curated, relevant subset rather than an ever-expanding haystack.

This post explains why teams skip it, what the cost looks like when they do, and how to build the layer properly—including the feedback loop that makes it compound over time.

Judge Model Independence: Why Your Eval Breaks When the Grader Shares Blind Spots with the Graded

· 9 min read
Tian Pan
Software Engineer

Your eval suite scores 91%. Users report the system feels unreliable. The post-mortem reveals the culprit: you used GPT-4o to both generate responses and grade them. The model was judging its own mirror image, and it liked what it saw.

This is the judge model independence problem. It is more widespread than most teams realize, the score inflation it produces is large enough to matter, and the fix is neither complicated nor expensive. But you have to know to look for it.

Keeping Synthetic Eval Data Honest

· 9 min read
Tian Pan
Software Engineer

A safety model scored 85.3% accuracy on its public benchmark test set. When researchers tested it on novel adversarial prompts not derived from public datasets, that number dropped to 33.8%. The model hadn't learned to reason about safety. It had learned to recognize the evaluation distribution.

This is the problem at the center of synthetic eval data: when the same model family generates both your training data and your test cases, passing the eval means conforming to a shared statistical prior—not demonstrating actual capability. It's a feedback loop that looks like quality assurance until production traffic arrives and the numbers don't hold.

The failure is structural, not incidental. And fixing it requires more than adding more synthetic examples.