Skip to main content

639 posts tagged with "llm"

View all tags

Where Production LLM Pipelines Leak User Data: PII, Residency, and the Compliance Patterns That Hold Up

· 12 min read
Tian Pan
Software Engineer

Most teams building LLM applications treat privacy as a model problem. They worry about what the model knows — its training data, its memorization — while leaving gaping holes in the pipeline around it. The embarrassing truth is that the vast majority of data leaks in production LLM systems don't come from the model at all. They come from the RAG chunks you index without redacting, the prompt logs you write to disk verbatim, the system prompts that contain database credentials, and the retrieval step that a poisoned document can hijack to exfiltrate everything in your knowledge base.

Gartner estimates that 30% of generative AI projects were abandoned by end of 2025 due to inadequate risk controls. Most of those failures weren't the model hallucinating — they were privacy and compliance failures in systems engineers thought were under control.

Long-Context Models vs. RAG: When the 1M-Token Window Is the Wrong Tool

· 9 min read
Tian Pan
Software Engineer

When Gemini 1.5 Pro launched with a 1M-token context window, a wave of engineers declared RAG dead. The argument seemed airtight: why build a retrieval pipeline with chunkers, embeddings, vector databases, and re-rankers when you can just dump your entire knowledge base into the prompt and let the model figure it out?

That argument collapses under production load. Gemini 1.5 Pro achieves 99.7% recall on the "needle in a haystack" benchmark — a single fact hidden in a document. On realistic multi-fact retrieval, average recall hovers around 60%. That 40% miss rate isn't a benchmarking artifact; it's facts your system silently fails to surface to users. And the latency for a 1M-token request runs 30–60x slower than a RAG pipeline at roughly 1,250x the per-query cost.

Long-context models are a powerful tool. They're just not the right tool for most production retrieval workloads.

The Model Upgrade Trap: How Foundation Model Updates Silently Break Production Systems

· 9 min read
Tian Pan
Software Engineer

Your production system is running fine. Uptime is 99.9%. Latency is nominal. Zero error-rate alerts. Then a user files a ticket: "The summaries have been weirdly off lately." You pull logs. Nothing looks wrong. You check the model version — same one you deployed three months ago. What changed?

The model provider did. Silently.

This is the model upgrade trap: foundation models change beneath you without announcement, and standard observability infrastructure is completely blind to the behavioral drift. By the time users notice, the degradation has been compounding for weeks.

Multi-Tenant LLM API Infrastructure: What Breaks at Scale

· 9 min read
Tian Pan
Software Engineer

Most teams start with a single API key for their LLM provider, shared across everything. It works until it doesn't. Then one afternoon, a bulk job in the data pipeline consumes the entire rate limit and the user-facing chat feature goes silent. Or finance asks you to break down the $40k LLM bill by team, and you realize you have no way to answer that question.

A production API gateway in front of your LLM providers solves both of these problems — but it introduces a category of complexity that most teams underestimate until they're already in trouble.

Multimodal LLM Inputs in Production: Vision, Documents, and the Failure Modes Nobody Warns You About

· 9 min read
Tian Pan
Software Engineer

Adding vision to an LLM application looks deceptively simple. You swap a text model for a multimodal one, pass in an image alongside your prompt, and the demo works brilliantly. Then you push to production and discover that half your invoices get the total wrong, tables in PDFs lose their structure, and low-quality scans produce confident hallucinations. The debugging is harder than anything you faced with text-only systems, because the failures are visual and the LLM will not tell you it cannot see clearly.

This post covers what actually goes wrong when you move multimodal LLM inputs from prototype to production, and the architectural decisions that prevent those failures.

Prompt Versioning in Production: The Engineering Discipline Teams Learn the Hard Way

· 10 min read
Tian Pan
Software Engineer

You get paged at 2am. Users are reporting garbage output. You SSH in, check logs, stare at traces — everything looks structurally fine. The model is responding. Latency is normal. But something is wrong with the answers. Then the question lands in your incident channel: "Which prompt version is actually running right now?"

If you can't answer that question in under thirty seconds, you have a prompt versioning problem.

Prompts are treated like configuration in most early-stage LLM projects. A product manager edits a string in a .env file, a developer pastes an updated instruction into a hardcoded constant, someone else pastes a slightly different version into a staging Slack channel. Eventually the versions diverge, and nobody has a complete picture of what's running where. The experimentation-phase casualness that got you to launch becomes a liability the moment you have real users.

Semantic Caching for LLM Applications: What the Benchmarks Don't Tell You

· 8 min read
Tian Pan
Software Engineer

Every vendor selling an LLM gateway will show you a slide with "95% cache hit rate." What that slide won't show you is the fine print: that number refers to match accuracy when a hit is found, not how often a hit is found in the first place. Real production systems see 20–45% hit rates — and that gap between marketing and reality is where most teams get burned.

Semantic caching is a genuinely useful technique. But deploying it without understanding its failure modes is how you end up returning wrong answers to users with high confidence, wondering why your support queue doubled.

JSON Mode Won't Save You: Structured Output Failures in Production LLM Systems

· 9 min read
Tian Pan
Software Engineer

When developers first wire up JSON mode, the response feels like solving a problem. The LLM stops returning markdown fences, prose apologies, and curly-brace-adjacent gibberish. The output parses. The tests pass. Production ships.

Then, three weeks later, a background job silently fails because the model returned {"status": "complete"} when the schema expected {"status": "completed"}. A data pipeline crashes because a required field came back as null instead of being omitted. An agent tool-call loop terminates early because the model embedded a stray newline inside a string value and the downstream parser choked on it.

JSON mode guarantees syntactically valid JSON. It does not guarantee that the JSON means what you think it means, contains the fields your application expects, or maintains semantic consistency across requests. These are different problems, and they require different solutions.

The Tool Selection Problem: How Agents Choose What to Call When They Have Dozens of Tools

· 10 min read
Tian Pan
Software Engineer

Most agent demos work with five tools. Production systems have fifty. The gap between those two numbers is where most agent architectures fall apart.

When you give an LLM four tools and a clear task, it usually picks the right one. When you give it fifty tools, something more interesting happens: accuracy collapses, token costs balloon, and the failure mode often looks like the model hallucinating a tool call rather than admitting it doesn't know which tool to use. Research from the Berkeley Function Calling Leaderboard found accuracy dropping from 43% to just 2% on calendar scheduling tasks when the number of tools expanded from 4 to 51 across multiple domains. That is not a graceful degradation curve.

When Thinking Models Actually Help: A Production Decision Framework for Inference-Time Compute

· 10 min read
Tian Pan
Software Engineer

There is a study where researchers asked a reasoning model to compare two numbers: 0.9 and 0.11. One model took 42 seconds to answer. The math took a millisecond. The model spent the remaining 41.9 seconds thinking — badly. It re-examined its answer, doubted itself, reconsidered, and arrived at the correct conclusion it had already reached in its first three tokens.

This is the overthinking problem, and it is not a corner case. It is what happens when you apply inference-time compute indiscriminately to tasks that don't need it.

The emergence of reasoning models — o1, o3, DeepSeek R1, Claude with extended thinking — represents a genuine capability leap for hard problems. It also introduces a new class of production mistakes: deploying expensive, slow deliberation where fast, cheap generation was perfectly adequate. Getting this decision right is increasingly central to building AI systems that actually work.

The Three Memory Systems Every Production AI Agent Needs

· 10 min read
Tian Pan
Software Engineer

Most AI agents fail the same way: they work perfectly in demos and fall apart after the tenth real conversation. The agent that helped a user configure a billing integration last Tuesday has no idea who that user is today. It asks for their company name again. Then their plan tier. Then re-explains concepts the user already knows. The experience degrades from "useful assistant" to "chatbot with amnesia."

The instinct is to throw more context at the problem — stuff the conversation history into the prompt and call it solved. That works until it doesn't. At scale, full-context approaches become prohibitively expensive, and more troublingly, performance degrades as input grows. Research shows LLM accuracy drops measurably as context length increases, even within a model's advertised limits. A 1M-token context window is not a memory system.

The agents that work in production treat memory as a first-class architectural concern, not an afterthought. And the ones that get it right distinguish between three fundamentally different types of information that need to persist — each with different storage patterns, retrieval strategies, and decay characteristics.

What Nobody Tells You About Running MCP in Production

· 10 min read
Tian Pan
Software Engineer

The Model Context Protocol sells itself as a USB-C port for AI — plug any tool into any model and watch them talk. In practice, the first day feels like that. The second day you hit a scaling bug. By the third day you're reading CVEs about tool poisoning attacks you didn't know existed.

MCP is a genuinely useful standard. Introduced in late 2024 and quickly adopted across the industry, it has solved real integration friction between LLMs and external systems. But the gap between "got a demo working" and "running reliably under load with real users" is larger than most teams expect. Here's what that gap actually looks like.