Skip to main content

578 posts tagged with "insider"

View all tags

Your Prompt Is Competing With What the Model Already Knows

· 11 min read
Tian Pan
Software Engineer

The frontier model you just wired up has opinions about your competitors. It has a default answer to the hard question your product was built to disagree with. It has a "best practice" for your domain that came from whatever happened to dominate the training corpus, and a quiet preference for the conventional take on every controversial call your team agonized over in the design doc. None of that is in your system prompt. You did not write any of it. And on the queries where your differentiation actually lives, the model will reach for those defaults before it reaches for what you told it.

Most teams ship as if the model is a configurable blank slate. Write the persona, list the rules, paste the brand voice guidelines, run a few QA prompts that produce the right shape of answer, and call it done. The prompts that get reviewed are the prompts that hit easy queries — the ones where the model's prior happens to align with what you wanted anyway. The interesting queries, the ones where your product would lose badly if it produced the generic answer, almost never make it into the prompt-iteration loop. Those are the queries where the prior wins silently.

Your RAG Chunker Is a Database Schema Nobody Code-Reviewed

· 11 min read
Tian Pan
Software Engineer

The first time a retrieval quality regression lands in your on-call channel, the debugging path almost always leads somewhere surprising. Not the embedding model. Not the reranker. Not the prompt. The culprit is a one-line change to the chunker — a tokenizer swap, a boundary rule tweak, a stride adjustment — that someone merged into a preprocessing notebook three sprints ago. The fix touched zero lines of production code. It rebuilt the index overnight. And now accuracy is down four points across every tenant.

The chunker is a database schema. Every field you extract, every boundary you draw, every stride you pick defines the shape of the rows that land in your vector index. Change any of them and you have altered the schema of an index that other parts of your system — retrieval logic, reranker features, evaluation harnesses, downstream prompts — depend on as if it were stable. But because the chunker usually lives in a notebook or a small Python module that nobody labels as "infrastructure," these changes ship with the rigor of a config tweak and the blast radius of an ALTER TABLE.

Why Your RAG Citations Are Lying: Post-Hoc Rationalization in Source Attribution

· 10 min read
Tian Pan
Software Engineer

Show a user an AI answer with a link at the end of each sentence, and the needle on their trust meter swings halfway across the dial before they have read a single cited passage. That is the whole marketing pitch of enterprise RAG: "grounded," "sourced," "verifiable." It is also the most-shipped, least-tested claim in AI engineering. Recent benchmarks find that between 50% and 90% of LLM responses are not fully supported — and sometimes contradicted — by the sources they cite. On adversarial evaluation sets, up to 57% of citations from state-of-the-art models are unfaithful: the model never actually used the document it is pointing at. The citation was attached after the fact, to rationalize an answer the model had already decided to give.

This is not a retrieval bug. You can have perfect retrieval and still get lying citations, because the failure is architectural. The generator writes prose first and stitches links on second. The links look like evidence. They are decoration.

The Reasoning-Model Tax at Tool Boundaries

· 10 min read
Tian Pan
Software Engineer

Extended thinking wins benchmarks on novel reasoning. At a tool boundary — the moment your agent has to pick which function to call, when to call it, and what arguments to pass — that same thinking budget often makes things worse. The model weighs three equivalent tools that a fast model would have disambiguated in one token. It manufactures plausible-sounding ambiguity where none existed. It burns a thousand reasoning tokens to second-guess the obvious search call, then calls search anyway. You paid the reasoning tax on a decision that didn't need reasoning.

This is the quiet cost center of agentic systems in 2026: not the reasoning model itself, which is priced fairly for what it does well, but the reasoning model deployed at the wrong step of the loop. The anti-pattern hides in plain sight because the top-of-loop task looks hard ("answer the user's question"), so teams wrap the entire loop in high-effort thinking mode and never notice that 80% of the thinking budget is being spent deliberating on tool-choice micro-decisions the model already got right on its first instinct.

Retry Amplification: How a 2% Tool Error Rate Becomes a 20% Agent Failure

· 13 min read
Tian Pan
Software Engineer

The spreadsheet on the oncall doc said the search tool had a 2% error rate. The incident review said the agent platform had a 20% failure rate during the three-hour window. Nobody disagreed with either number. The search team was not at fault. The platform team did not ship a bug. The gap between the two numbers is the whole story, and it is a story about arithmetic, not engineering incompetence.

Retry logic is one of the most borrowed and least adapted patterns in agent systems. Teams copy tenacity decorators from their REST client, stack them at the SDK, the gateway, and the agent loop, and ship. Each layer is individually reasonable. The composition is a siege weapon pointed at the flakiest dependency in the fleet, and it fires hardest at the exact moment that dependency needs the load to drop.

This post is about how that math works, why agent loops amplify it harder than request-response systems, and the retry discipline that keeps transient blips from becoming correlated outages with your own logo on them.

The Right-Edge Accuracy Drop: Why the Last 20% of Your Context Window Is a Trap

· 11 min read
Tian Pan
Software Engineer

A 200K-token context window is not a 200K-token context window. Fill it to the brim and the model you just paid for quietly becomes a worse version of itself — not at the middle, where "lost in the middle" would predict, but at the right edge, exactly where recency bias was supposed to save you. The label on the box sold you headroom; the silicon sells you a cliff.

This is a different failure mode from the one most teams have internalized. "Lost in the middle" trained a generation of prompt engineers to stuff the critical instruction at the top and the critical question at the bottom, confident that primacy and recency would carry the signal through. That heuristic silently breaks when utilization approaches the claimed window. The drop-off is not gradual, not linear, and not symmetric with how the model behaves at half-fill. Past a utilization threshold that varies by model, you are operating in a different regime, and the prompt shape that worked at 30K fails at 180K.

The economic temptation makes it worse. If you just paid for a million-token window, the pressure to use it is enormous — dump the entire repo, feed it every support ticket, hand it the quarterly filings and let it figure out what matters. That is how you get a confidently wrong answer that looks well-reasoned on the surface and disintegrates on audit.

The Rubber-Stamp Collapse: Why AI-Authored PRs Are Hollowing Out Code Review

· 10 min read
Tian Pan
Software Engineer

A senior engineer approves a 400-line PR in four minutes. The diff is clean. Names are sensible. Tests pass. Two weeks later the on-call engineer is paging through a query that returns the right shape of rows but from the wrong column — user.updated_at where user.created_at was meant — and the cohort analysis dashboard has been quietly lying to the CFO for nine days. The reviewer was competent. The code was well-structured. The bug was invisible in the diff because it wasn't a syntactic smell. It was a semantic one, and the reviewer had nothing to anchor against because no one had written down what the change was supposed to do.

This is the failure mode that shows up once the majority of diffs in your repo start life as model output. Reviewers stop asking "is this correct?" and start asking "does this look like code?" The answer is almost always yes. AI-authored code is grammatically fluent in a way that bypasses the review heuristics engineers spent a decade sharpening on human-written slop.

Semantic Cache Is a Safety Problem, Not a Perf Win

· 12 min read
Tian Pan
Software Engineer

A semantic cache hit is the only LLM optimization that can serve the wrong answer to the wrong user in under a millisecond. SQL caches return your row or someone else's because somebody wrote a bad join — the failure mode is a query bug. Semantic caches return another tenant's response because two embeddings landed within 0.03 cosine of each other, which is the system working exactly as designed. The cache is doing its job. The job is the problem.

Most teams ship semantic caching as a cost initiative — there's a "70% bill reduction" deck floating around every AI engineering Slack — and review the cache key the way they'd review a Redis TTL: not at all. That review goes to the perf team. The safety team never sees the design doc because nobody filed a security review for "we added a faster path." Six months later somebody's compliance audit finds that "I can't log into my account, my email is [email protected]" and "I can't log into my account, my email is [email protected]" both vectorized within threshold of "I can't log into my account" and the cache cheerfully served Bob the response originally generated for Jane, including the password reset link her account had asked for.

This post is about why semantic caches deserve the same review rigor as SQL predicates, the cache-key design that prevents cross-user leak by construction, and the audit trail you need to distinguish "cache hit served the right answer" from "cache hit served someone else's answer at sub-millisecond latency."

Spec-First Agents: Why the Contract Has to Land Before the Prompt

· 11 min read
Tian Pan
Software Engineer

The prompt for our support agent was 2,400 tokens when I inherited it, and the engineer who wrote it had left the company. Every instruction in it was load-bearing for some production behavior, but nobody could tell me which ones. A bullet about "always restate the user's problem before answering" looked like filler until we deleted it and CSAT dropped four points in a week. The prompt was, it turned out, the specification. It was also the implementation. It was also the test suite — implicit, unwritten, living only in the head of an engineer who no longer worked there.

That is the endgame of prompt-as-spec. The prompt is both what the agent should do and how it does it, and the two become indistinguishable the moment the prompt outgrows a single author. You can't refactor it because you don't know which lines encode requirements and which encode mere hints. You can't review a change because there's no artifact the change is a diff against. You can't onboard anyone to own it because ownership means "having read the whole thing recently and remembering why each clause is there," which is a six-month investment nobody authorizes.

Spec-first flips the ordering. The contract — inputs, outputs, invariants, error cases, refusal semantics, escalation triggers — is a first-class artifact that precedes the prompt and constrains every revision. Prompt edits become diffs against a spec, not rewrites of the spec itself. The shift sounds bureaucratic until you see what it unlocks: evals that come from the spec rather than the other way around, reviews that take minutes instead of afternoons, and the eventual ability to let a new engineer own the surface without a six-month apprenticeship.

Token Spend Is a Security Signal Your SOC Isn't Watching

· 11 min read
Tian Pan
Software Engineer

The fastest-moving breach signal in your stack isn't in your SIEM. It's in a spreadsheet someone in finance opens on the first of the month. When an attacker steals an LLM API key, exploits a prompt injection to exfiltrate data, or rides a compromised tenant session to query an adjacent customer's memory, the footprint shows up first as a token-usage anomaly — long before any DLP rule fires, any auth alert trips, or any endpoint agent notices something weird. Billing sees it. Security doesn't.

That gap is not theoretical. Sysdig's threat research team coined "LLMjacking" after watching attackers rack up five-figure daily bills on stolen cloud credentials, and the category has since matured into an organized criminal industry with $30-per-account marketplaces and documented campaigns pushing victim costs past $100,000 per day. OWASP catalogued a startup that ate a $200,000 bill in 48 hours from a leaked key. A Stanford research group burned $9,200 in 12 hours on a forgotten token in a Jupyter notebook. The common thread in every one of these incidents: the billing graph told the story hours or days before anyone in security noticed.

Tool Hallucination Rate: The Probe Suite Your Agent Team Isn't Running

· 9 min read
Tian Pan
Software Engineer

Ask an agent team what their tool-call success rate is and you will get an answer. Ask them what their tool-hallucination rate is and the room goes quiet. Most teams do not track it, and the ones who do usually only count the catastrophic version — a function name that does not exist in the catalog — while the quieter, more expensive variants travel through production unmetered.

A hallucinated tool call is not only when the model invents delete_orphaned_users(older_than="30d") and your dispatcher throws ToolNotFoundError. That is the easy case. The harder case is when the fabricated call shadows into an adjacent real tool through fuzzy matching, or when the tool name is correct but the agent invents an argument your schema happily accepts because you marked it optional. Both of those pass your "did a tool call succeed" dashboard. Neither is what the user asked for.

The Unmergeable Agentic Refactor: Why Multi-File Diffs Break at the Seam

· 9 min read
Tian Pan
Software Engineer

A 40-file refactor from a coding agent lands on your desk. You open the PR, scroll through the diff, and every hunk looks fine. The rename is consistent, the imports are tidy, the tests compile in isolation. You merge. Forty minutes later, CI on main goes red because two call sites in a sibling package still pass three arguments to a function that now takes four, and the type checker that would have caught it was never part of the agent's inner loop.

This is the most common failure mode in agent-authored refactors today, and it has almost nothing to do with the quality of the individual edits. Each file, reviewed on its own, looks like something a careful human would have written. The bug lives at the seams — the boundaries where edits from different files have to agree. File-level review hides seam-level correctness, and most review workflows were designed around files.