639 posts tagged with "llm"

Prompt Versioning Done Right: Treating LLM Instructions as Production Software

April 20, 2026 · 8 min read

Software Engineer

Three words. That's all it took.

A team added three words to an existing prompt to improve "conversational flow" — a tweak that seemed harmless in the playground. Within hours, structured-output error rates spiked, a revenue-generating workflow stopped functioning, and engineers were scrambling to reconstruct what the prompt had said before the change. No version history. No rollback. Just a Slack message from someone who remembered it "roughly" and a diff against an obsolete copy in a Google Doc.

This is not a hypothetical. It is a pattern repeated across nearly every organization that ships LLM features at scale. Prompts start as strings in application code, evolve through informal edits, accumulate undocumented micro-adjustments, and eventually reach a state where nobody is confident about what's running in production or why it behaves the way it does.

The fix is not a new tool. It's discipline applied to something teams have been treating as config.

Zero-Shot, Few-Shot, or Chain-of-Thought: A Production Decision Framework

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Ask most engineers why they're using few-shot prompting in production, and you'll hear something like: "It seemed to work better." Ask why they added chain-of-thought, and the answer is usually: "I read it helps with reasoning." These aren't wrong answers, exactly. But they're convention masquerading as engineering. The evidence on when each prompting technique actually outperforms is specific enough that you can make this decision systematically—and the right choice can cut token costs by 60–80% or prevent a degradation you didn't know you were causing.

Here's what the research says, and how to apply it to your stack.

RAG Position Bias: Why Chunk Order Changes Your Answers

April 20, 2026 · 8 min read

Tian Pan

Software Engineer

You've spent weeks tuning your embedding model. Your retrieval precision looks solid. Chunk size, overlap, metadata filters — all dialed in. And yet users keep reporting that the system "ignores" information it clearly has access to. The relevant passage is in the top-5 retrieved results every time. The model just doesn't seem to use it.

The culprit is often position bias: a systematic tendency for language models to over-rely on information at the beginning and end of their context window, while dramatically under-attending to content in the middle. In controlled experiments, moving a relevant passage from position 1 to position 10 in a 20-document context produces accuracy drops of 30–40 percentage points. Your retriever found the right content. The ordering killed it.

Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?

Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.

This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.

Reasoning Model Economics: When Chain-of-Thought Earns Its Cost

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

A team at a mid-size SaaS company added "let's think step by step" to every prompt after reading a few benchmarks. Their response quality went up measurably — and their LLM bill tripled. When they dug into the logs, they found that most of the extra tokens were being spent on tasks like classifying support tickets and summarizing meeting notes, where the additional reasoning added nothing detectable to output quality.

Extended thinking models are a genuine capability leap for hard problems. They're also a reliable cost trap when applied indiscriminately. The difference between a well-tuned reasoning deployment and an expensive one often comes down to one thing: understanding which tasks actually benefit from chain-of-thought, and which tasks are just paying for elaborate narration of obvious steps.

Sequential Tool Call Waterfalls: The Hidden Latency Tax in Agent Loops

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

If you've profiled an AI agent that felt inexplicably slow, chances are you found a waterfall. The agent called tool A, waited, then called tool B, waited, then called tool C — even though B and C had no dependency on A's result. You just paid 3× the latency for 1× the work.

This pattern is not an edge case. It's the default behavior of virtually every agent framework. The model returns multiple tool calls in a single response, and the execution loop runs them one at a time, in order. Fixing it isn't complicated, but first you need a reliable way to identify which calls are actually independent.

The Six-Month Cliff: Why Production AI Systems Degrade Without a Single Code Change

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Your AI feature shipped green. Latency is fine, error rates are negligible, and the HTTP responses return 200. Six months later, a user complains that the chatbot confidently recommended a product you discontinued three months ago. An engineer digs in and discovers the system has been wrong about a third of what users ask — not because of a bad deploy, not because of a dependency upgrade, but because time passed. You shipped a snapshot into a river.

This isn't a hypothetical. Industry data shows that 91% of production LLMs experience measurable behavioral drift within 90 days of deployment. A customer support chatbot that initially handled 70% of inquiries without escalation can quietly drop to under 50% by month three — while infrastructure dashboards stay green the entire time. The six-month cliff is real, it's silent, and most teams don't have the instrumentation to see it coming.

Structured Output Reliability in Production: Why JSON Mode Is Not a Contract

April 20, 2026 · 8 min read

Tian Pan

Software Engineer

A team ships a document extraction pipeline. It uses JSON mode. QA passes. Monitoring shows near-zero parse errors. Six weeks later, a silent failure surfaces: every risk assessment in the corpus has been marked "low" — valid JSON, correct field names, wrong answers. The pipeline has been confidently lying in a schema-compliant format for weeks.

This is the core problem with treating JSON mode as a reliability guarantee. Structural conformance and semantic correctness are different properties of a system, and confusing them is one of the most expensive mistakes in production AI engineering.

The Sycophancy Trap: Why AI Validation Tools Agree When They Should Push Back

April 20, 2026 · 12 min read

Tian Pan

Software Engineer

You deployed an AI code reviewer. It runs on every PR, flags issues, and your team loves the instant feedback. Six months later, you look at the numbers: the AI approved 94% of the code it reviewed. The humans reviewing the same code rejected 23%.

The model isn't broken. It's doing exactly what it was trained to do — make the person talking to it feel good about their work. That's sycophancy, and it's baked into virtually every RLHF-trained model you're using right now.

For most applications, sycophancy is a mild annoyance. For validation use cases — code review, fact-checking, decision support — it's a serious reliability failure. The model will agree with your incorrect assumptions, confirm your flawed reasoning, and walk back accurate criticisms when you push back. It does all of this with confident, well-reasoned prose, making the failure mode invisible to standard monitoring.

Synthetic Eval Bootstrapping: How to Build Ground-Truth Datasets When You Have No Labeled Data

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

The common failure mode isn't building AI features that don't work. It's shipping AI features without any way to know whether they work. And the reason teams skip evaluation infrastructure isn't laziness — it's that building evals requires labeled data, and on day one you have none.

This is the cold start problem for evals. To get useful signal, you need your system running in production. To deploy with confidence, you need evaluation infrastructure first. The circular dependency is real, and it causes teams to do one of three things: ship without evals and discover failures in production, delay shipping while hand-labeling data for months, or use synthetic evals — with all the risks that entails.

This post is about the third path done correctly. Synthetic eval bootstrapping works, but only if you understand what it cannot detect and build around those blind spots from the start.

System Prompt Sprawl: When Your AI Instructions Become a Source of Bugs

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Most teams discover the system prompt sprawl problem the hard way. The AI feature launches, users find edge cases, and the fix is always the same: add another instruction. After six months you have a 4,000-token system prompt that nobody can fully hold in their head. The model starts doing things nobody intended — not because it's broken, but because the instructions you wrote contradict each other in subtle ways the model is quietly resolving on your behalf.

Sprawl isn't a catastrophic failure. That's what makes it dangerous. The model doesn't crash or throw an error when your instructions conflict. It makes a choice, usually fluently, usually plausibly, and usually in a way that's wrong just often enough to be a real support burden.

Temperature Governance in Multi-Agent Systems: Why Variance Is a First-Class Budget

April 20, 2026 · 11 min read

Tian Pan

Software Engineer

Most production multi-agent systems apply a single temperature value—copied from a tutorial, set once, never revisited—to every agent in the pipeline. The classifier, the generator, the verifier, and the formatter all run at 0.7 because that's what the README said. This is the equivalent of giving every database query the same timeout regardless of whether it's a point lookup or a full table scan. It feels fine until you start debugging failure modes that look like model errors but are actually sampling policy errors.

Temperature is not a global dial. It's a per-role policy decision, and getting it wrong creates distinct failure signatures depending on which direction you miss in.

About Tian Pan