Skip to main content

567 posts tagged with "llm"

View all tags

RAG-Specific Prompt Injection: How Adversarial Documents Hijack Your Retrieval Pipeline

· 9 min read
Tian Pan
Software Engineer

Most teams securing RAG applications focus their effort in the wrong place. They validate user inputs, sanitize queries, implement rate limiting, and add output filters. All of that is necessary — and none of it stops the attack that matters most in RAG systems.

The defining vulnerability in retrieval-augmented generation isn't at the user input layer. It's at the retrieval layer — inside the documents your system pulls from its own knowledge base and injects directly into the context window. An attacker who never sends a single request to your API can still compromise your system by planting a document in your corpus. Your input validation never fires. Your injection filters never trigger. The malicious instruction arrives in your LLM's context dressed as legitimate retrieved content, and the model executes it.

The Query Rewrite Layer Your RAG System Is Missing

· 10 min read
Tian Pan
Software Engineer

Most teams tuning a RAG system focus on two levers: chunking strategy and embedding model selection. When retrieval quality degrades, they re-chunk. When recall numbers look bad, they upgrade the embedding model. Both are reasonable moves — but they're optimizing the middle of the pipeline while leaving the highest-leverage point untouched.

The user's query is almost never in the ideal form for vector retrieval. It's terse, colloquial, ambiguous, or assumes context that the index doesn't have. No matter how good your embeddings are, if you're searching with a poorly formed query, you're going to retrieve poorly. The fix isn't downstream — it's transforming the query before it reaches the vector index.

The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'

· 10 min read
Tian Pan
Software Engineer

Ask a production RAG system a question your corpus cannot answer and watch what happens. It rarely says "I don't have that information." Instead, it retrieves the five highest-ranked chunks — which, having nothing better to match, are the five least-bad chunks of unrelated content — and hands them to the model with a prompt that reads something like "answer the user's question using the context below." The model, trained to be helpful and now holding text that sort of resembles the topic, produces a confident answer. The answer is wrong in a way that's architecturally invisible: the retrieval succeeded, the generation succeeded, every span was grounded in a retrieved document, and the user walked away misled.

This is the retrieval emptiness problem. It isn't a bug in any single layer. It's the emergent behavior of a pipeline that treats "top-k" as a contract and never asks whether the top-k is any good. Research published at ICLR 2025 on "sufficient context" quantified the effect: when Gemma receives sufficient context, its hallucination rate on factual QA is around 10%. When it receives insufficient context — retrieved documents that don't actually contain the answer — that rate jumps to 66%. Adding retrieved documents to an under-specified query makes the model more confidently wrong, not less.

Research Agent Design: Why Scientific Workflows Break Coding Agent Assumptions

· 10 min read
Tian Pan
Software Engineer

Most teams that build LLM-powered scientific tools make the same architectural mistake: they reach for a coding agent framework, swap in domain-specific tools, and call it a research agent. It isn't. Coding agents and research agents share surface-level mechanics — both call tools, both iterate — but their fundamental assumptions about success, state, and termination are almost perfectly inverted. Deploying a coding agent architecture in a scientific workflow doesn't just produce worse results; it produces confidently wrong results, and does so in ways that are nearly impossible to catch after the fact.

The distinction matters urgently now because research agent benchmarks are proliferating, teams are racing to build scientific AI, and the "just use a coding agent" shortcut is generating a wave of plausible-sounding tools that fail in production scientific contexts for reasons their builders don't fully understand.

Designing AI Safety Layers That Don't Kill Your Latency

· 9 min read
Tian Pan
Software Engineer

Most teams reach for guardrails the same way they reach for logging: bolt it on, assume it's cheap, move on. It isn't cheap. A content moderation check takes 10–50ms. Add PII detection, another 20–80ms. Throw in output schema validation and a toxicity classifier and you're looking at 200–400ms of overhead stacked serially before a single token reaches the user. Combine that with a 500ms model response and your "fast" AI feature now feels sluggish.

The instinct to blame the LLM is wrong. The guardrails are the bottleneck. And the fix isn't to remove safety — it's to stop treating safety checks as an undifferentiated pile and start treating them as an architecture problem.

SFT, RLHF, and DPO: The Alignment Method Decision Matrix for Narrow Domain Applications

· 11 min read
Tian Pan
Software Engineer

Most teams that decide to fine-tune a model spend weeks debating which method to use before they've written a single line of training code. The debate rarely surfaces the right question. The real question is not "SFT or DPO?" — it's "what kind of gap am I trying to close?"

Supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), and direct preference optimization (DPO) are not competing answers to the same problem. Each targets a different failure mode. Reaching for RLHF when SFT would have sufficed wastes months. Reaching for SFT when the problem is actually a preference mismatch produces a model that's fluent but wrong in ways that are hard to detect until they surface in production.

This post is a decision framework. It maps each method to the specific problem it solves, explains what signals indicate which method will dominate, and provides a diagnostic methodology for identifying where your actual gap lives before you commit to a training run.

TTFT Is the Only Latency Metric Your Users Actually Feel

· 9 min read
Tian Pan
Software Engineer

Your model generates a 500-word response in 8 seconds. A competing model generates the same response in 12 seconds. Intuitively, yours should feel faster. But if your first token arrives at 2.5 seconds and theirs arrives at 400 milliseconds, your users will describe your product as slow — regardless of total generation time. This is the central paradox of LLM latency: the metric your infrastructure team optimizes for (end-to-end generation time, tokens per second) is not the metric your users experience. Time-to-first-token is.

TTFT is not a detail. It is the primary signal users use to judge whether your AI feature is responsive. Getting it wrong means building fast systems that feel slow.

Sycophancy Is a Production Reliability Failure, Not a Personality Quirk

· 10 min read
Tian Pan
Software Engineer

Most teams think about sycophancy as a UX annoyance — the model that says "great question!" too often. That framing is dangerously incomplete. Sycophancy is a systematic accuracy failure baked in by training, and in agentic systems it compounds silently across turns until an incorrect intermediate conclusion poisons every downstream tool call that depends on it. The canonical April 2025 incident made this concrete: OpenAI shipped a GPT-4o update that endorsed a user's plan to stop psychiatric medication and validated a business idea for "shit on a stick" before a rollback was triggered four days later — after exposure to 180 million users. The root cause wasn't a prompt mistake. It was a reward signal that had been tuned on short-term user approval, which is almost perfectly anti-correlated with long-term accuracy.

The Delegation Cliff: Why AI Agent Reliability Collapses at 7+ Steps

· 8 min read
Tian Pan
Software Engineer

An agent with 95% per-step reliability sounds impressive. At 10 steps, you have a 60% chance of success. At 20 steps, it's down to 36%. At 50 steps, you're looking at a coin flip—and that's with a generous 95% estimate. Field data suggests real-world agents fail closer to 20% per action, which means a 100-step task succeeds roughly 0.00002% of the time. This isn't a model quality problem or a prompt engineering problem. It's a compounding math problem, and most teams building agents haven't internalized it yet.

This is the delegation cliff: the point at which adding one more step to an agent's task doesn't linearly increase the chance of failure—it multiplies it.

Token Budget as a Product Constraint: Designing Around Context Limits Instead of Pretending They Don't Exist

· 10 min read
Tian Pan
Software Engineer

Most AI products treat the context limit as an implementation detail to hide from users. That decision looks clean in demos and catastrophic in production. When a user hits the limit mid-task, one of three things happens: the request throws a hard error, the model silently starts hallucinating because critical earlier context was dropped, or the product resets the session and destroys all accumulated state. None of these are acceptable outcomes for a product you're asking people to trust with real work.

The token budget isn't a quirk to paper over. It's a first-class product constraint that belongs in your design process the same way memory limits belong in systems programming. The teams that ship reliable AI features have stopped pretending the ceiling doesn't exist.

When AI Features Create Moats (and When They Don't)

· 9 min read
Tian Pan
Software Engineer

A leaked internal Google memo put it plainly: "We aren't positioned to win this arms race and neither is OpenAI." The author's argument was that fine-tuning a model with LoRA costs roughly $100, that open-source communities could replicate closed-model capabilities within months, and that "we have no moat." This was a Google researcher writing about Google. If that's true inside the world's best-resourced AI lab, what does it mean for your product team betting on a data advantage?

The honest answer is that most AI features are not moats. They are rented capabilities with a UI. But some genuinely compound — and the difference is not about how much data you have. It's about the specific mechanical conditions under which data actually creates defensibility.

The Agent Test Pyramid: Why the 70/20/10 Split Breaks Down for Agentic AI

· 12 min read
Tian Pan
Software Engineer

Every engineering organization that graduates from "we have a chatbot" to "we have an agent" hits the same wall: their test suite stops making sense.

The classical test pyramid — 70% unit tests, 20% integration tests, 10% end-to-end — is built on three foundational assumptions: units are cheap to run, isolated from external systems, and deterministic. Agentic AI systems violate all three at once. A "unit" is a model call that costs tokens and returns different answers each time. An end-to-end run can take several minutes and burn through API budget that a junior engineer's entire sprint's tests couldn't justify. And isolation is nearly impossible when the agent's intelligence emerges precisely from interacting with external tools and state.