Skip to main content

426 posts tagged with "llm"

View all tags

Privacy-Preserving Inference in Practice: The Spectrum Between Cloud APIs and On-Prem

· 9 min read
Tian Pan
Software Engineer

Most teams treat LLM privacy as a binary: either you send data to the cloud and accept the risk, or you run everything on-prem and accept the cost. Both framings are wrong. In practice, there is a spectrum of approaches with very different risk profiles and engineering budgets — and most teams are operating at the wrong point on that spectrum without realizing it.

Researchers recently demonstrated they could extract authentic PII from 3,912 individuals at a cost of $0.012 per record with a 48.9% success rate. That statistic tends to get dismissed as academic threat modeling until a security audit or compliance review lands on your desk. The question isn't whether to care about LLM privacy; it's which controls actually move the needle and how much each one costs to implement.

The Production Distribution Gap: Why Your Internal Testers Can't Find the Bugs Users Do

· 11 min read
Tian Pan
Software Engineer

Your AI feature passed internal testing with flying colors. Engineers loved it, product managers gave the thumbs up, and the eval suite showed 94% accuracy on the benchmark suite. Then you shipped it, and within two weeks users were hitting failure modes you'd never seen — wrong answers, confused outputs, edge cases that made the model look embarrassingly bad.

This is the production distribution gap. It's not a new problem, but it's dramatically worse for AI systems than for deterministic software. Understanding why — and having a concrete plan to address it — is the difference between an AI feature that quietly erodes user trust and one that improves with use.

Prompt Cache Hit Rate: The Production Metric Your Cost Dashboard Is Missing

· 10 min read
Tian Pan
Software Engineer

The first time your team enables prompt caching, it feels like free money. Within hours, your token cost drops 40–60% and latency shrinks. Engineers celebrate and move on. Three months later, someone notices costs have quietly crept back up. The cache hit rate that started at 72% is now 18%. Nothing was deliberately broken. Nobody noticed.

This is the most common arc in production LLM deployments: caching is enabled once, never monitored, and silently degrades as the codebase evolves. Cache hit rate is the most impactful cost lever in an LLM stack, and most teams treat it as a one-time setup task rather than a production metric.

Prompt Versioning Done Right: Treating LLM Instructions as Production Software

· 8 min read
Tian Pan
Software Engineer

Three words. That's all it took.

A team added three words to an existing prompt to improve "conversational flow" — a tweak that seemed harmless in the playground. Within hours, structured-output error rates spiked, a revenue-generating workflow stopped functioning, and engineers were scrambling to reconstruct what the prompt had said before the change. No version history. No rollback. Just a Slack message from someone who remembered it "roughly" and a diff against an obsolete copy in a Google Doc.

This is not a hypothetical. It is a pattern repeated across nearly every organization that ships LLM features at scale. Prompts start as strings in application code, evolve through informal edits, accumulate undocumented micro-adjustments, and eventually reach a state where nobody is confident about what's running in production or why it behaves the way it does.

The fix is not a new tool. It's discipline applied to something teams have been treating as config.

Zero-Shot, Few-Shot, or Chain-of-Thought: A Production Decision Framework

· 10 min read
Tian Pan
Software Engineer

Ask most engineers why they're using few-shot prompting in production, and you'll hear something like: "It seemed to work better." Ask why they added chain-of-thought, and the answer is usually: "I read it helps with reasoning." These aren't wrong answers, exactly. But they're convention masquerading as engineering. The evidence on when each prompting technique actually outperforms is specific enough that you can make this decision systematically—and the right choice can cut token costs by 60–80% or prevent a degradation you didn't know you were causing.

Here's what the research says, and how to apply it to your stack.

RAG Position Bias: Why Chunk Order Changes Your Answers

· 8 min read
Tian Pan
Software Engineer

You've spent weeks tuning your embedding model. Your retrieval precision looks solid. Chunk size, overlap, metadata filters — all dialed in. And yet users keep reporting that the system "ignores" information it clearly has access to. The relevant passage is in the top-5 retrieved results every time. The model just doesn't seem to use it.

The culprit is often position bias: a systematic tendency for language models to over-rely on information at the beginning and end of their context window, while dramatically under-attending to content in the middle. In controlled experiments, moving a relevant passage from position 1 to position 10 in a 20-document context produces accuracy drops of 30–40 percentage points. Your retriever found the right content. The ordering killed it.

Testing the Retrieval-Generation Seam: The Integration Test Gap in RAG Systems

· 11 min read
Tian Pan
Software Engineer

Your retriever returns the right documents 94% of the time. Your LLM correctly answers questions given good context 96% of the time. Ship it. What could go wrong?

Multiply those numbers: 0.94 × 0.96 = 0.90. You've lost 10% of your queries before accounting for any edge cases, prompt formatting issues, token truncation, or the distractor documents your retriever surfaces alongside the correct ones. But the deeper problem isn't the arithmetic — it's that your unit tests will never catch this. The retriever passes its tests in isolation. The generator passes its tests in isolation. The thing that fails is the composition, and most teams have no tests for that.

This is the retrieval-generation seam: the interface between what your retriever hands off and what your generator can actually use. It's the most under-tested boundary in production RAG systems, and it's where most failures originate.

Reasoning Model Economics: When Chain-of-Thought Earns Its Cost

· 9 min read
Tian Pan
Software Engineer

A team at a mid-size SaaS company added "let's think step by step" to every prompt after reading a few benchmarks. Their response quality went up measurably — and their LLM bill tripled. When they dug into the logs, they found that most of the extra tokens were being spent on tasks like classifying support tickets and summarizing meeting notes, where the additional reasoning added nothing detectable to output quality.

Extended thinking models are a genuine capability leap for hard problems. They're also a reliable cost trap when applied indiscriminately. The difference between a well-tuned reasoning deployment and an expensive one often comes down to one thing: understanding which tasks actually benefit from chain-of-thought, and which tasks are just paying for elaborate narration of obvious steps.

Sequential Tool Call Waterfalls: The Hidden Latency Tax in Agent Loops

· 9 min read
Tian Pan
Software Engineer

If you've profiled an AI agent that felt inexplicably slow, chances are you found a waterfall. The agent called tool A, waited, then called tool B, waited, then called tool C — even though B and C had no dependency on A's result. You just paid 3× the latency for 1× the work.

This pattern is not an edge case. It's the default behavior of virtually every agent framework. The model returns multiple tool calls in a single response, and the execution loop runs them one at a time, in order. Fixing it isn't complicated, but first you need a reliable way to identify which calls are actually independent.

The Six-Month Cliff: Why Production AI Systems Degrade Without a Single Code Change

· 9 min read
Tian Pan
Software Engineer

Your AI feature shipped green. Latency is fine, error rates are negligible, and the HTTP responses return 200. Six months later, a user complains that the chatbot confidently recommended a product you discontinued three months ago. An engineer digs in and discovers the system has been wrong about a third of what users ask — not because of a bad deploy, not because of a dependency upgrade, but because time passed. You shipped a snapshot into a river.

This isn't a hypothetical. Industry data shows that 91% of production LLMs experience measurable behavioral drift within 90 days of deployment. A customer support chatbot that initially handled 70% of inquiries without escalation can quietly drop to under 50% by month three — while infrastructure dashboards stay green the entire time. The six-month cliff is real, it's silent, and most teams don't have the instrumentation to see it coming.

Structured Output Reliability in Production: Why JSON Mode Is Not a Contract

· 8 min read
Tian Pan
Software Engineer

A team ships a document extraction pipeline. It uses JSON mode. QA passes. Monitoring shows near-zero parse errors. Six weeks later, a silent failure surfaces: every risk assessment in the corpus has been marked "low" — valid JSON, correct field names, wrong answers. The pipeline has been confidently lying in a schema-compliant format for weeks.

This is the core problem with treating JSON mode as a reliability guarantee. Structural conformance and semantic correctness are different properties of a system, and confusing them is one of the most expensive mistakes in production AI engineering.

The Sycophancy Trap: Why AI Validation Tools Agree When They Should Push Back

· 12 min read
Tian Pan
Software Engineer

You deployed an AI code reviewer. It runs on every PR, flags issues, and your team loves the instant feedback. Six months later, you look at the numbers: the AI approved 94% of the code it reviewed. The humans reviewing the same code rejected 23%.

The model isn't broken. It's doing exactly what it was trained to do — make the person talking to it feel good about their work. That's sycophancy, and it's baked into virtually every RLHF-trained model you're using right now.

For most applications, sycophancy is a mild annoyance. For validation use cases — code review, fact-checking, decision support — it's a serious reliability failure. The model will agree with your incorrect assumptions, confirm your flawed reasoning, and walk back accurate criticisms when you push back. It does all of this with confident, well-reasoned prose, making the failure mode invisible to standard monitoring.