Skip to main content

702 posts tagged with "llm"

View all tags

RAG's Dirty Secret: Your Retrieval Succeeds but Your Answers Are Still Wrong

· 9 min read
Tian Pan
Software Engineer

Most teams building RAG systems think they have two failure modes: retrieval fails to find the relevant document, or the LLM hallucinates despite having it. The first is measured obsessively — recall@K, MRR, NDCG. The second is treated as the model's problem. Neither framing is complete.

There's a third failure mode that sits between them: retrieval succeeds (the relevant document ranks in the top-K), but the retrieved context doesn't actually contain enough information to answer the question correctly. The model gets confident, generates a plausible answer, and gets it wrong. Research on frontier models including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 shows this happens at rates above 50% on multi-step queries — and most production systems have no instrumentation to detect it.

The Reasoning Trace Privacy Problem: How Chain-of-Thought Leaks Sensitive Data in Production

· 9 min read
Tian Pan
Software Engineer

Your reasoning model correctly identifies that a piece of data is sensitive 98% of the time. Yet it leaks that same data in its chain-of-thought 33% of the time. That gap — between knowing something is private and actually keeping it private — is the core of the reasoning trace privacy problem, and most production teams haven't built for it.

Extended thinking has become a standard tool for accuracy-hungry applications: customer support triage, medical coding assistance, legal document review, financial analysis. These are also exactly the domains where the data in the prompt is most sensitive. Deploying reasoning models in these contexts without understanding how traces handle that data is a significant exposure.

The Reasoning Trace Privacy Problem: What Your CoT Logs Are Leaking

· 8 min read
Tian Pan
Software Engineer

Most teams building on reasoning models treat privacy as a two-surface problem: sanitize the prompt going in, sanitize the response coming out. The reasoning trace in between gets logged wholesale for observability, surfaced to downstream systems for debugging, and sometimes passed back to users who asked to "see the thinking." That middle layer is where the real exposure lives — and most production deployments are not treating it like the liability it is.

Research from early 2026 quantified what practitioners have been observing anecdotally: large reasoning models (LRMs) leak personally identifiable information in their intermediate reasoning steps more often than in their final answers. In one study testing five open-source models across medical and financial scenarios, the finding was unambiguous — intermediate reasoning reliably surfaces PII that the final response had successfully withheld. The final answer is sanitized; the trace is not.

Semantic Caching for LLMs: The Cost Tier Most Teams Skip

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM applications know about prompt caching — the prefix-reuse mechanism that API providers offer to discount repeated input tokens. Far fewer have deployed the layer above it: semantic caching, which eliminates LLM calls entirely for queries that mean the same thing but are phrased differently. The gap isn't laziness; it's a widespread misunderstanding of what "95% accuracy" means in semantic caching vendor documentation.

That 95% figure refers to match correctness on cache hits, not to how often the cache actually gets hit. Real production hit rates range from 10% for open-ended chat to 70% for structured FAQ systems — and the math that determines which side of that range you're on should happen before you write any cache code.

Structured Output Reliability in Production LLM Systems

· 10 min read
Tian Pan
Software Engineer

Your LLM pipeline hits 97% success rate in testing. Then it ships, and somewhere in the tail of real-world usage, a JSON parse failure silently corrupts downstream state, a missing field causes a null-pointer exception three steps later, or a response wrapped in markdown fences breaks your extraction logic at 2am. Structured output failures are the unsung reliability killer of production AI systems — they rarely show up in benchmarks, they compound invisibly in multi-step pipelines, and they're entirely preventable if you understand the actual problem.

The uncomfortable truth: naive JSON prompting fails 15–20% of the time in production environments. For a pipeline making a thousand LLM calls per day, that's 150–200 silent failures. And because those errors often don't surface immediately — they propagate forward as malformed data, not exceptions — they're the hardest class of bug to detect and debug.

Text-to-SQL in Production: Why Correct SQL Is the Easy Part

· 10 min read
Tian Pan
Software Engineer

GPT-4o scores 86.6% on the Spider benchmark. Deploy it against your actual data warehouse and you might get 10%. That gap is not a rounding error—it is the entire problem. The queries that make up the missing 76% execute without errors, return rows with the correct schema, and are completely wrong.

Text-to-SQL is not a syntax problem. Every serious production deployment discovers the same uncomfortable truth: the hard failures are silent ones. A query that scans a 10TB Snowflake table, returns revenue figures that are 30% too high due to a duplicated join, or quietly bypasses row-level security looks identical to a correct query from the outside. It finishes, it returns data, and nobody flags it.

This post covers the failure modes that actually bite teams in production, and the layered architecture that prevents them.

Agentic Coding in Production: What SWE-bench Scores Don't Tell You

· 11 min read
Tian Pan
Software Engineer

When a frontier model scores 80% on SWE-bench Verified, it sounds like a solved problem. Four out of five real GitHub issues, handled autonomously. Ship it to your team. Except: that same model, on SWE-bench Pro — a benchmark specifically designed to resist contamination with long-horizon tasks from proprietary codebases — scores 23%. And a rigorous controlled study of experienced developers found that using AI coding tools made them 19% slower, not faster.

These numbers aren't contradictions. They're the gap between what benchmarks measure and what production software engineering actually requires. If you're building or buying into agentic coding tools, that gap is the thing worth understanding.

CI/CD for LLM Applications: Why Deploying a Prompt Is Nothing Like Deploying Code

· 10 min read
Tian Pan
Software Engineer

Your code ships through a pipeline: feature branch → pull request → automated tests → staging → production. Every step is gated. Nothing reaches users without passing the checks you've defined. It's boring in the best way.

Now imagine you need to update a system prompt. You edit the string in your dashboard, hit save, and the change is live immediately — no tests, no staging, no diff in version control, no way to roll back except by editing it back by hand. This is how most teams operate, and it's the reason prompt changes are the primary source of unexpected production outages for LLM applications.

The challenge isn't that teams are careless. It's that the discipline of continuous delivery was built for deterministic systems, and LLMs aren't deterministic. The entire mental model needs to be rebuilt from scratch.

The Context Stuffing Antipattern: Why More Context Makes LLMs Worse

· 9 min read
Tian Pan
Software Engineer

When 1M-token context windows shipped, many teams took it as permission to stop thinking about context design. The reasoning was intuitive: if the model can see everything, just give it everything. Dump the document. Pass the full conversation history. Forward every tool output to the next agent call. Let the model sort it out.

This is the context stuffing antipattern, and it produces a characteristic failure mode: systems that work fine in early demos, then hit a reliability ceiling in production that no amount of prompt tweaking seems to fix. Accuracy degrades on questions that should be straightforward. Answers become hedged and non-committal. Agents start hallucinating joins between documents that aren't related. The model "saw" all the right information — it just couldn't find it.

Your Database Schema Is Your Agent's Mental Model

· 9 min read
Tian Pan
Software Engineer

Most teams building agents treat their database schema as a backend concern. The schema was designed by engineers, for engineers, following decades of relational database best practices: normalize aggressively, avoid redundancy, split reference tables, enforce foreign keys. This approach is correct for OLTP systems. It is often wrong for AI agents.

When an agent reads your schema to figure out how to answer a question, it is not parsing a data structure. It is constructing a mental model of your business. If your schema was built for application code that already understands the domain, the agent will be working against a map drawn for someone else. The result is hallucinated joins, incorrect aggregations, and tool call chains that should take two steps but take eight.

Feature Flags for AI: Progressive Delivery of LLM-Powered Features

· 10 min read
Tian Pan
Software Engineer

Most teams discover the hard way that rolling out a new LLM feature is nothing like rolling out a new UI button. A prompt change that looked great in offline evaluation ships to production and silently degrades quality for 30% of users — but your dashboards show HTTP 200s the whole time. By the time you notice, thousands of users have had bad experiences and you have no fast path back to the working state.

The same progressive delivery toolkit that prevents traditional software failures — feature flags, canary releases, A/B testing — applies directly to LLM-powered features. But the mechanics are different enough that copy-pasting your existing deployment playbook will get you into trouble. Non-determinism, semantic quality metrics, and the multi-layer nature of LLM changes (model, prompt, parameters, retrieval strategy) each create wrinkles that teams routinely underestimate.

Fine-Tuning Economics: The Real Cost Calculation Before You Commit

· 10 min read
Tian Pan
Software Engineer

Most engineers underestimate fine-tuning costs by a factor of three to five. The training run is the smallest part of the bill. Data curation, failed experiments, deployment infrastructure, and ongoing model maintenance are where budgets actually go. Teams that skip this math end up months into a fine-tuning project before realizing that a well-engineered prompt with few-shot examples would have solved the problem in a week.

This post walks through the complete economics — what fine-tuning actually costs across its full lifecycle, when LoRA and PEFT make the math work, and a decision framework for choosing between fine-tuning and prompt engineering based on real production numbers.