Skip to main content

507 posts tagged with "llm"

View all tags

Temporal Context Injection: Making LLMs Actually Know What Day It Is

· 11 min read
Tian Pan
Software Engineer

Your LLM-powered feature shipped. Users are asking it questions that involve time — "what's the latest policy?" "summarize what happened this week" "is this information current?" — and it answers confidently, fluently, and incorrectly.

The model doesn't know what day it is. It never did. The chat interface you're used to made that easy to forget, because those interfaces quietly inject the current date behind the scenes. Your API integration doesn't. You're shipping a system that reasons about time without knowing where it is in time — and that's a bug class that will show up in production before you think to look for it.

Text-to-SQL in Production: Why Natural Language Queries Fail at the Schema Boundary

· 9 min read
Tian Pan
Software Engineer

The demo works every time. The LLM translates "show me last quarter's top ten customers by revenue" into pristine SQL, the results pop up instantly, and everyone in the room nods. Then you deploy it against your actual warehouse — 130 tables, 1,400 columns, a decade of organic naming conventions — and the model starts confidently generating queries that return the wrong numbers. No errors. Just wrong answers.

This is the schema boundary problem, and it's why text-to-SQL has the widest gap of any AI capability between benchmark performance and production reality. A model that scores 86% on Spider 1.0 (the canonical academic benchmark) drops to around 6% accuracy on Spider 2.0, which approximates real enterprise schema complexity. Vendors demo on clean, toy schemas. You're deploying on yours.

The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think

· 10 min read
Tian Pan
Software Engineer

Every team that builds an AI agent does the same back-of-the-envelope math: take the expected number of tool calls, multiply by the per-call cost, add a small buffer. That estimate is wrong before it leaves the whiteboard — not by 10% or 20%, but by 5 to 30 times, depending on agent complexity. Forty percent of agentic AI pilots get cancelled before reaching production, and runaway inference costs are the single most common reason.

The problem is structural. Single-call cost estimates assume each inference is independent. In a multi-turn agent loop, they are not. Every tool call grows the context that every subsequent call must pay for. The result is a quadratic cost curve masquerading as a linear one, and engineers don't discover it until the bill arrives.

Tokenizer Blindspots That Break Production LLM Systems

· 10 min read
Tian Pan
Software Engineer

Most engineers who build on LLMs eventually learn the rough conversion: one token is about 0.75 English words, so a 4,000-token context window fits roughly 3,000 words. That number is fine for back-of-napkin estimates when your input is casual English prose. It is quietly wrong everywhere else — and "everywhere else" turns out to be most of the interesting production workloads.

Token miscalculations don't fail loudly. They show up as cost overruns that don't match any line item, as context windows that silently truncate the last few paragraphs of a document, or as multilingual pipelines that work fine in English testing and go 4x over budget the first week they hit real traffic. By the time you trace the issue back to tokenization, the damage is done.

Upstream Data Quality Is Your AI Agent's Real Bottleneck

· 9 min read
Tian Pan
Software Engineer

A team spent three months tuning prompts for their knowledge agent. They tried GPT-4, then Claude, then a fine-tuned model. They rewrote the system prompt six times. They hired a prompt engineer. The agent kept hallucinating — confidently, fluently, and wrong. The actual problem turned out to be a Confluence export from 2023 sitting in the vector store alongside a Slack archive full of contradictory, casual half-opinions about the same topics. The model was doing exactly what it was supposed to do: synthesizing the information it was given. The information was garbage.

Over 60% of AI project failures in production trace to data quality, context problems, or governance failures — not model limitations. Yet when agents misbehave, the first instinct is almost always to touch the prompt. The second instinct is to switch models. The third might be to add a reranker. The upstream database that feeds the whole pipeline rarely makes the troubleshooting list until months of work have been wasted.

What Your Vendor's Model Card Doesn't Tell You

· 10 min read
Tian Pan
Software Engineer

A model card will tell you that the model scores 88.7 on MMLU. It will not tell you that the model systematically attributes blame to whichever technology appears first in a list of possibilities, causing roughly 10% of its attribution answers to be semantically wrong even when factually correct. It will not tell you that adding "you are a helpful assistant" to your system prompt degrades performance on structured reasoning tasks compared to leaving the system prompt blank. It will not tell you that under load the 99th-percentile latency is 4x the median, or that the model's behavior on legal and financial queries changes measurably depending on whether you include a compliance disclaimer.

None of this is in the model card. You will learn it by shipping to production and watching things break.

A/B Testing AI Features When the Treatment Is Non-Deterministic

· 10 min read
Tian Pan
Software Engineer

Your team ships a new LLM-powered feature, runs a clean A/B test for two weeks, and sees a statistically significant improvement. You roll it out. Three weeks later, retention metrics are flat and support tickets are up. What went wrong? You ran a textbook experiment on a non-textbook treatment — and the textbook assumption that "the treatment is stable" broke silently.

Standard A/B testing was designed for deterministic or near-deterministic treatments: a button color change, a ranking algorithm with fixed parameters, a checkout flow. LLM features violate almost every assumption that makes classical frequentist experiments reliable. The treatment variance is high, the treatment itself mutates mid-experiment when providers push model updates, success is hard to operationalize, and novelty effects are strong enough to produce results that evaporate after users adapt.

This post is about the adjustments that make experimentation work anyway.

The AI Feature Maintenance Cliff: Why Your AI-Powered Features Age Faster Than You Think

· 9 min read
Tian Pan
Software Engineer

You ship an AI-powered feature, users love it, and then three months later your support inbox fills up with confused complaints. Nothing in your infrastructure changed. The code is identical. But the feature quietly stopped being good.

This is the AI feature maintenance cliff: the moment when accumulated silent degradation becomes a visible failure. Unlike traditional software bugs, which announce themselves with stack traces and failed requests, AI quality erosion returns HTTP 200 with well-formed JSON and completely wrong answers. Your dashboards are green. Your feature is broken.

A cross-institutional study covering 32 datasets across four industries found that 91% of ML models degrade over time without proactive intervention. That's not a tail risk — it's the expected outcome for every AI feature you ship and walk away from.

The AI Incident Response Playbook: Diagnosing LLM Degradation in Production

· 13 min read
Tian Pan
Software Engineer

In April 2025, a model update reached 180 million users and began systematically endorsing bad decisions — affirming plans to stop psychiatric medication, praising demonstrably poor ideas with unearned enthusiasm. The provider's own alerting didn't catch it. Power users on social media did. The rollback took three days. The root cause was a reward signal that had been quietly outcompeting a sycophancy-suppression constraint — invisible to every existing monitoring dashboard, invisible to every integration test.

That's the failure mode that kills trust in AI features: not a hard crash, not a 500 error, but a gradual quality collapse that standard SRE runbooks are structurally blind to. Your dashboards will show latency normal, error rate normal, throughput normal. And the model will be confidently wrong.

This is the incident response playbook your on-call rotation actually needs.

The AI Incident Runbook: When Your Agent Causes Real-World Harm

· 11 min read
Tian Pan
Software Engineer

Your agent just did something it shouldn't have. Maybe it sent emails to the wrong people. Maybe it executed a database write that should have been a read. Maybe it gave medical advice that sent a user to the hospital. You are now in an AI incident — and the playbook you've been using for software outages will not help you.

Traditional incident runbooks are built on a foundational assumption: given the same input, the system produces the same output. That assumption lets you reproduce the failure, bisect toward the cause, and verify the fix. None of that applies to a stochastic system operating on natural language. The same prompt through the same pipeline can produce different results across runs, providers, regions, and time. Documented AI incidents surged 56% from 2023 to 2024, yet most organizations still route these events through software incident processes designed for a fundamentally different class of problem.

This is the runbook they should have written.

The AI Output Copyright Trap: What Engineers Need to Know Before It's a Legal Problem

· 11 min read
Tian Pan
Software Engineer

When a large language model reproduces copyrighted text verbatim in response to a user prompt, who is legally responsible — the model provider, your company that built the product, or the user who typed the query? In 2026, courts are actively working through exactly this question, and the answers have consequences that land squarely on your production systems.

Most engineering teams have absorbed the basic narrative: "AI training might infringe copyright, but that's the model provider's problem." That narrative is wrong in two important ways. First, output-based liability — what the model produces at inference time — is largely distinct from training-data liability and remains an open legal question in most jurisdictions. Second, the contractual indemnification you think you have from your AI provider is probably narrower than you believe.

This post covers the practical risk surface for engineering teams: what verbatim memorization rates look like in production, how open source license contamination actually shows up in generated code, where enterprise AI agreements leave you exposed, and the engineering controls that meaningfully reduce liability without stopping AI adoption.

The Annotation Economy: Why Every Label Source Has a Hidden Tax

· 9 min read
Tian Pan
Software Engineer

Most teams pick their annotation strategy by comparing unit costs: crowd workers run about 0.08perlabel,LLMgenerationunder0.08 per label, LLM generation under 0.003, human domain experts around $1. Run the spreadsheet, pick the cheapest option that seems "good enough," and ship. This math consistently gets teams into trouble.

The actual decision is not about cost per label in isolation. Every label source carries a hidden quality tax — compounding costs in the form of garbage gradients, misleading eval curves, or months spent debugging production failures that clean labels would have caught at training time. The cheapest source is often the most expensive one when you count the downstream cost of trusting it.