Skip to main content

426 posts tagged with "llm"

View all tags

Synthetic Eval Bootstrapping: How to Build Ground-Truth Datasets When You Have No Labeled Data

· 10 min read
Tian Pan
Software Engineer

The common failure mode isn't building AI features that don't work. It's shipping AI features without any way to know whether they work. And the reason teams skip evaluation infrastructure isn't laziness — it's that building evals requires labeled data, and on day one you have none.

This is the cold start problem for evals. To get useful signal, you need your system running in production. To deploy with confidence, you need evaluation infrastructure first. The circular dependency is real, and it causes teams to do one of three things: ship without evals and discover failures in production, delay shipping while hand-labeling data for months, or use synthetic evals — with all the risks that entails.

This post is about the third path done correctly. Synthetic eval bootstrapping works, but only if you understand what it cannot detect and build around those blind spots from the start.

System Prompt Sprawl: When Your AI Instructions Become a Source of Bugs

· 9 min read
Tian Pan
Software Engineer

Most teams discover the system prompt sprawl problem the hard way. The AI feature launches, users find edge cases, and the fix is always the same: add another instruction. After six months you have a 4,000-token system prompt that nobody can fully hold in their head. The model starts doing things nobody intended — not because it's broken, but because the instructions you wrote contradict each other in subtle ways the model is quietly resolving on your behalf.

Sprawl isn't a catastrophic failure. That's what makes it dangerous. The model doesn't crash or throw an error when your instructions conflict. It makes a choice, usually fluently, usually plausibly, and usually in a way that's wrong just often enough to be a real support burden.

Temperature Governance in Multi-Agent Systems: Why Variance Is a First-Class Budget

· 11 min read
Tian Pan
Software Engineer

Most production multi-agent systems apply a single temperature value—copied from a tutorial, set once, never revisited—to every agent in the pipeline. The classifier, the generator, the verifier, and the formatter all run at 0.7 because that's what the README said. This is the equivalent of giving every database query the same timeout regardless of whether it's a point lookup or a full table scan. It feels fine until you start debugging failure modes that look like model errors but are actually sampling policy errors.

Temperature is not a global dial. It's a per-role policy decision, and getting it wrong creates distinct failure signatures depending on which direction you miss in.

Temporal Context Injection: Making LLMs Actually Know What Day It Is

· 11 min read
Tian Pan
Software Engineer

Your LLM-powered feature shipped. Users are asking it questions that involve time — "what's the latest policy?" "summarize what happened this week" "is this information current?" — and it answers confidently, fluently, and incorrectly.

The model doesn't know what day it is. It never did. The chat interface you're used to made that easy to forget, because those interfaces quietly inject the current date behind the scenes. Your API integration doesn't. You're shipping a system that reasons about time without knowing where it is in time — and that's a bug class that will show up in production before you think to look for it.

Text-to-SQL in Production: Why Natural Language Queries Fail at the Schema Boundary

· 9 min read
Tian Pan
Software Engineer

The demo works every time. The LLM translates "show me last quarter's top ten customers by revenue" into pristine SQL, the results pop up instantly, and everyone in the room nods. Then you deploy it against your actual warehouse — 130 tables, 1,400 columns, a decade of organic naming conventions — and the model starts confidently generating queries that return the wrong numbers. No errors. Just wrong answers.

This is the schema boundary problem, and it's why text-to-SQL has the widest gap of any AI capability between benchmark performance and production reality. A model that scores 86% on Spider 1.0 (the canonical academic benchmark) drops to around 6% accuracy on Spider 2.0, which approximates real enterprise schema complexity. Vendors demo on clean, toy schemas. You're deploying on yours.

The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think

· 10 min read
Tian Pan
Software Engineer

Every team that builds an AI agent does the same back-of-the-envelope math: take the expected number of tool calls, multiply by the per-call cost, add a small buffer. That estimate is wrong before it leaves the whiteboard — not by 10% or 20%, but by 5 to 30 times, depending on agent complexity. Forty percent of agentic AI pilots get cancelled before reaching production, and runaway inference costs are the single most common reason.

The problem is structural. Single-call cost estimates assume each inference is independent. In a multi-turn agent loop, they are not. Every tool call grows the context that every subsequent call must pay for. The result is a quadratic cost curve masquerading as a linear one, and engineers don't discover it until the bill arrives.

Tokenizer Blindspots That Break Production LLM Systems

· 10 min read
Tian Pan
Software Engineer

Most engineers who build on LLMs eventually learn the rough conversion: one token is about 0.75 English words, so a 4,000-token context window fits roughly 3,000 words. That number is fine for back-of-napkin estimates when your input is casual English prose. It is quietly wrong everywhere else — and "everywhere else" turns out to be most of the interesting production workloads.

Token miscalculations don't fail loudly. They show up as cost overruns that don't match any line item, as context windows that silently truncate the last few paragraphs of a document, or as multilingual pipelines that work fine in English testing and go 4x over budget the first week they hit real traffic. By the time you trace the issue back to tokenization, the damage is done.

Upstream Data Quality Is Your AI Agent's Real Bottleneck

· 9 min read
Tian Pan
Software Engineer

A team spent three months tuning prompts for their knowledge agent. They tried GPT-4, then Claude, then a fine-tuned model. They rewrote the system prompt six times. They hired a prompt engineer. The agent kept hallucinating — confidently, fluently, and wrong. The actual problem turned out to be a Confluence export from 2023 sitting in the vector store alongside a Slack archive full of contradictory, casual half-opinions about the same topics. The model was doing exactly what it was supposed to do: synthesizing the information it was given. The information was garbage.

Over 60% of AI project failures in production trace to data quality, context problems, or governance failures — not model limitations. Yet when agents misbehave, the first instinct is almost always to touch the prompt. The second instinct is to switch models. The third might be to add a reranker. The upstream database that feeds the whole pipeline rarely makes the troubleshooting list until months of work have been wasted.

What Your Vendor's Model Card Doesn't Tell You

· 10 min read
Tian Pan
Software Engineer

A model card will tell you that the model scores 88.7 on MMLU. It will not tell you that the model systematically attributes blame to whichever technology appears first in a list of possibilities, causing roughly 10% of its attribution answers to be semantically wrong even when factually correct. It will not tell you that adding "you are a helpful assistant" to your system prompt degrades performance on structured reasoning tasks compared to leaving the system prompt blank. It will not tell you that under load the 99th-percentile latency is 4x the median, or that the model's behavior on legal and financial queries changes measurably depending on whether you include a compliance disclaimer.

None of this is in the model card. You will learn it by shipping to production and watching things break.

A/B Testing AI Features When the Treatment Is Non-Deterministic

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM-powered feature, runs a clean A/B test for two weeks, and sees a statistically significant improvement. You roll it out. Three weeks later, retention metrics are flat and support tickets are up. What went wrong? You ran a textbook experiment on a non-textbook treatment — and the textbook assumption that "the treatment is stable" broke silently.

Standard A/B testing was designed for deterministic or near-deterministic treatments: a button color change, a ranking algorithm with fixed parameters, a checkout flow. LLM features violate almost every assumption that makes classical frequentist experiments reliable. The treatment variance is high, the treatment itself mutates mid-experiment when providers push model updates, success is hard to operationalize, and novelty effects are strong enough to produce results that evaporate after users adapt.

This post is about the adjustments that make experimentation work anyway.

The AI Feature Maintenance Cliff: Why Your AI-Powered Features Age Faster Than You Think

· 9 min read
Tian Pan
Software Engineer

You ship an AI-powered feature, users love it, and then three months later your support inbox fills up with confused complaints. Nothing in your infrastructure changed. The code is identical. But the feature quietly stopped being good.

This is the AI feature maintenance cliff: the moment when accumulated silent degradation becomes a visible failure. Unlike traditional software bugs, which announce themselves with stack traces and failed requests, AI quality erosion returns HTTP 200 with well-formed JSON and completely wrong answers. Your dashboards are green. Your feature is broken.

A cross-institutional study covering 32 datasets across four industries found that 91% of ML models degrade over time without proactive intervention. That's not a tail risk — it's the expected outcome for every AI feature you ship and walk away from.

The AI Incident Response Playbook: Diagnosing LLM Degradation in Production

· 13 min read
Tian Pan
Software Engineer

In April 2025, a model update reached 180 million users and began systematically endorsing bad decisions — affirming plans to stop psychiatric medication, praising demonstrably poor ideas with unearned enthusiasm. The provider's own alerting didn't catch it. Power users on social media did. The rollback took three days. The root cause was a reward signal that had been quietly outcompeting a sycophancy-suppression constraint — invisible to every existing monitoring dashboard, invisible to every integration test.

That's the failure mode that kills trust in AI features: not a hard crash, not a 500 error, but a gradual quality collapse that standard SRE runbooks are structurally blind to. Your dashboards will show latency normal, error rate normal, throughput normal. And the model will be confidently wrong.

This is the incident response playbook your on-call rotation actually needs.