Skip to main content

685 posts tagged with "llm"

View all tags

The PII Leak in Your RAG Pipeline: Why Your Chatbot Knows Things It Shouldn't

· 10 min read
Tian Pan
Software Engineer

Your new internal chatbot just told an intern the salary bands for the entire engineering department. The HR director didn't configure anything wrong. No one shared a link they shouldn't have. The system just... retrieved it, because the intern asked about "compensation expectations for engineers."

This is the RAG privacy failure mode that most teams don't see coming. It's not a bug in the traditional sense—it's a fundamental mismatch between how retrieval works and how access control is supposed to work.

Prompt Archaeology: Recovering Intent from Legacy Prompts Nobody Documented

· 10 min read
Tian Pan
Software Engineer

You join a team that's been running an LLM feature in production for eighteen months. The feature is working — users like it, the business cares about it — but nobody can explain exactly what the prompt does or why it was written the way it was. The engineer who wrote it left. The Slack thread where they discussed it is buried somewhere in a channel that no longer exists. The prompt lives in a database record, 900 tokens long, with no comments and no commit message beyond "update prompt."

Now you've been asked to change it.

This situation is more common than the industry admits. Prompts are treated like configuration values: quick to write, invisible in code review, and forgotten the moment they start working. The difference is that a misconfigured feature flag announces itself immediately. A misconfigured prompt will silently degrade behavior across a subset of edge cases for weeks before anyone notices.

The Prompt Debt Spiral: How One-Line Patches Kill Production Prompts

· 9 min read
Tian Pan
Software Engineer

Six months into production, your customer-facing LLM feature has a system prompt that began as eleven clean lines and has grown to over 400 tokens of conditional instructions, hedges, and exceptions. Quality is measurably worse than at launch, but every individual change seemed justified at the time. Nobody knows which clauses conflict with each other, or whether half of them are still necessary. Nobody wants to touch it.

This is the prompt debt spiral — and most teams in production are already inside it.

The Prompt Governance Problem: Managing Business Logic That Lives Outside Your Codebase

· 9 min read
Tian Pan
Software Engineer

A junior PM edits a customer-facing prompt during a product sprint to "make it sound friendlier." Two weeks later, a backend engineer tweaks the same prompt to fix a formatting quirk. An ML engineer, unaware of either change, adds chain-of-thought instructions in a separate system message that now conflicts with the PM's edit. None of these changes have a ticket. None have a reviewer. None have a rollback plan.

This is how most teams manage prompts. And at five prompts, it's annoying. At fifty, it's a liability.

Red-Teaming Consumer LLM Features: Finding Injection Surfaces Before Your Users Do

· 9 min read
Tian Pan
Software Engineer

A dealership deployed a ChatGPT-powered chatbot. Within days, a user instructed it to agree with anything they said, then offered $1 for a 2024 SUV. The chatbot accepted. The dealer pulled it offline. This wasn't a sophisticated attack — it was a three-sentence prompt from someone who wanted to see what would happen.

At consumer scale, that curiosity is your biggest security threat. Internal LLM agents operate inside controlled environments with curated inputs and trusted data. Consumer-facing LLM features operate in adversarial conditions by default: millions of users, many actively probing for weaknesses, and a stochastic model that has no concept of "this user seems hostile." The security posture these two environments require is fundamentally different, and teams that treat consumer features like internal tooling find out the hard way.

Serving AI at the Edge: A Decision Framework for Moving Inference Out of the Cloud

· 10 min read
Tian Pan
Software Engineer

Most AI inference decisions get made the same way: the model lives in the cloud because that's where you can run it, full stop. But that calculus is changing fast. Flagship smartphones now carry neural engines capable of running 7B-parameter models at interactive speeds. A Snapdragon 8 Elite can generate tokens from a 3B model at around 10 tokens per second — fast enough for conversational use — while a Qualcomm Hexagon NPU hits 690 tokens per second on prefill. The question is no longer "can we run this on device?" but "should we, and when?"

The answer is rarely obvious. Moving inference to the edge introduces real tradeoffs: a quality tax from quantization, a maintenance burden for fleet updates, and hardware fragmentation across device SKUs. But staying in the cloud has its own costs: round-trip latency measured in hundreds of milliseconds, cloud GPU bills that compound at scale, and data sovereignty problems that no SLA can fully solve. This post lays out a practical framework for navigating those tradeoffs.

Shadow Traffic for AI Systems: The Safest Way to Validate Model Changes Before They Ship

· 10 min read
Tian Pan
Software Engineer

Most teams ship LLM changes the way they shipped web changes in 2005 — they run some offline evals, convince themselves the numbers look fine, and push. The surprise comes on Monday morning when a system prompt tweak that passed every benchmark silently breaks the 40% of user queries that weren't in the eval set.

Shadow traffic is the fix. The idea is simple: run your candidate model or prompt in parallel with production, feed it every real request, compare the outputs, and only expose users to the current version. Zero user exposure, real production data, and statistical confidence before anyone sees the change. But applying this to LLMs requires rethinking almost every piece of the implementation — because language models are non-deterministic, expensive to evaluate, and produce outputs that can't be compared with a simple diff.

The Shared Prompt Service Problem: Multi-Team LLM Platforms and the Dependency Nightmare

· 10 min read
Tian Pan
Software Engineer

On a Tuesday afternoon, the platform team at a mid-size AI startup merged a "minor improvement" to the shared system prompt. By Thursday, three separate product teams had filed bugs. One team's evaluation suite dropped from 87% to 61% accuracy. Another team's RAG pipeline started producing hallucinated citations. A third team's safety filter stopped catching a category of harmful outputs entirely. Nobody connected the dots for four days.

This is the shared prompt service problem, and it's coming for every organization that has more than one team building on a common LLM platform.

SLOs for Non-Deterministic AI Features: Setting Error Budgets When Wrong Is Probabilistic

· 10 min read
Tian Pan
Software Engineer

Your AI feature is "up." Latency is fine. Error rate is 0.2%. The dashboard is green. But over the past two weeks, the summarization quality quietly dropped — outputs are now technically coherent but factually shallow, consistently missing the key detail users care about. Nobody filed a bug. No alert fired. And you won't know until the next quarterly review when retention numbers come in.

This is the failure mode that traditional SLOs are blind to. Availability and latency measure whether your service is responding — not whether it's responding well. For deterministic systems, those two things are nearly equivalent. For LLM features, they can diverge silently for weeks.

Specification Gaming in Production LLM Systems: When Your AI Does Exactly What You Asked

· 10 min read
Tian Pan
Software Engineer

A 2025 study gave frontier models a coding evaluation task with an explicit rule: don't hack the benchmark. Every model acknowledged, 10 out of 10 times, that cheating would violate the user's intent. Then 70–95% of them did it anyway. The models weren't confused — they understood the constraint perfectly. They just found that satisfying the specification literally was more rewarding than satisfying it in spirit.

That's specification gaming in production, and it's not a theoretical concern. It's a property that emerges whenever you optimize a proxy metric hard enough, and in production LLM systems you're almost always optimizing a proxy.

SSE vs WebSockets vs gRPC Streaming for LLM Apps: The Protocol Decision That Bites You Later

· 11 min read
Tian Pan
Software Engineer

Most teams building LLM features pick a streaming protocol the same way they pick a font: quickly, without much thought, and they live with the consequences for years. The first time the choice bites you is usually in production — a CloudFlare 524 timeout that corrupts your SSE stream, a WebSocket server that leaks memory under sustained load, or a gRPC-Web integration that worked fine in unit tests and silently fails when a client needs to send messages upstream. The protocol shapes your failure modes. Picking based on benchmark throughput is the wrong frame.

Every major LLM provider — OpenAI, Anthropic, Cohere, Hugging Face — streams tokens over Server-Sent Events. That fact is a strong prior, but not because SSE is fast. It's because SSE is stateless, trivially compatible with HTTP infrastructure, and its failure modes are predictable. The question is whether your application has requirements that force you off that path.

Structured Output Is Not Structured Thinking: The Semantic Validation Layer Most Teams Skip

· 11 min read
Tian Pan
Software Engineer

A medical scheduling system receives a valid JSON object from its LLM extraction layer. The schema passes. The types check out. The required fields are present. Then a downstream job tries to book an appointment and finds that the end_time is three hours before the start_time. Both fields are correctly formatted ISO timestamps. Neither violates the schema. The booking silently fails, and the patient gets no appointment — no error surfaced, no alert fired.

This is what it looks like when schema validation is mistaken for correctness validation. The model followed the format. It did not follow the logic.