Skip to main content

907 posts tagged with "insider"

View all tags

Semantic Cache Is a Safety Problem, Not a Perf Win

· 12 min read
Tian Pan
Software Engineer

A semantic cache hit is the only LLM optimization that can serve the wrong answer to the wrong user in under a millisecond. SQL caches return your row or someone else's because somebody wrote a bad join — the failure mode is a query bug. Semantic caches return another tenant's response because two embeddings landed within 0.03 cosine of each other, which is the system working exactly as designed. The cache is doing its job. The job is the problem.

Most teams ship semantic caching as a cost initiative — there's a "70% bill reduction" deck floating around every AI engineering Slack — and review the cache key the way they'd review a Redis TTL: not at all. That review goes to the perf team. The safety team never sees the design doc because nobody filed a security review for "we added a faster path." Six months later somebody's compliance audit finds that "I can't log into my account, my email is [email protected]" and "I can't log into my account, my email is [email protected]" both vectorized within threshold of "I can't log into my account" and the cache cheerfully served Bob the response originally generated for Jane, including the password reset link her account had asked for.

This post is about why semantic caches deserve the same review rigor as SQL predicates, the cache-key design that prevents cross-user leak by construction, and the audit trail you need to distinguish "cache hit served the right answer" from "cache hit served someone else's answer at sub-millisecond latency."

Spec-First Agents: Why the Contract Has to Land Before the Prompt

· 11 min read
Tian Pan
Software Engineer

The prompt for our support agent was 2,400 tokens when I inherited it, and the engineer who wrote it had left the company. Every instruction in it was load-bearing for some production behavior, but nobody could tell me which ones. A bullet about "always restate the user's problem before answering" looked like filler until we deleted it and CSAT dropped four points in a week. The prompt was, it turned out, the specification. It was also the implementation. It was also the test suite — implicit, unwritten, living only in the head of an engineer who no longer worked there.

That is the endgame of prompt-as-spec. The prompt is both what the agent should do and how it does it, and the two become indistinguishable the moment the prompt outgrows a single author. You can't refactor it because you don't know which lines encode requirements and which encode mere hints. You can't review a change because there's no artifact the change is a diff against. You can't onboard anyone to own it because ownership means "having read the whole thing recently and remembering why each clause is there," which is a six-month investment nobody authorizes.

Spec-first flips the ordering. The contract — inputs, outputs, invariants, error cases, refusal semantics, escalation triggers — is a first-class artifact that precedes the prompt and constrains every revision. Prompt edits become diffs against a spec, not rewrites of the spec itself. The shift sounds bureaucratic until you see what it unlocks: evals that come from the spec rather than the other way around, reviews that take minutes instead of afternoons, and the eventual ability to let a new engineer own the surface without a six-month apprenticeship.

Token Spend Is a Security Signal Your SOC Isn't Watching

· 11 min read
Tian Pan
Software Engineer

The fastest-moving breach signal in your stack isn't in your SIEM. It's in a spreadsheet someone in finance opens on the first of the month. When an attacker steals an LLM API key, exploits a prompt injection to exfiltrate data, or rides a compromised tenant session to query an adjacent customer's memory, the footprint shows up first as a token-usage anomaly — long before any DLP rule fires, any auth alert trips, or any endpoint agent notices something weird. Billing sees it. Security doesn't.

That gap is not theoretical. Sysdig's threat research team coined "LLMjacking" after watching attackers rack up five-figure daily bills on stolen cloud credentials, and the category has since matured into an organized criminal industry with $30-per-account marketplaces and documented campaigns pushing victim costs past $100,000 per day. OWASP catalogued a startup that ate a $200,000 bill in 48 hours from a leaked key. A Stanford research group burned $9,200 in 12 hours on a forgotten token in a Jupyter notebook. The common thread in every one of these incidents: the billing graph told the story hours or days before anyone in security noticed.

Tool Hallucination Rate: The Probe Suite Your Agent Team Isn't Running

· 9 min read
Tian Pan
Software Engineer

Ask an agent team what their tool-call success rate is and you will get an answer. Ask them what their tool-hallucination rate is and the room goes quiet. Most teams do not track it, and the ones who do usually only count the catastrophic version — a function name that does not exist in the catalog — while the quieter, more expensive variants travel through production unmetered.

A hallucinated tool call is not only when the model invents delete_orphaned_users(older_than="30d") and your dispatcher throws ToolNotFoundError. That is the easy case. The harder case is when the fabricated call shadows into an adjacent real tool through fuzzy matching, or when the tool name is correct but the agent invents an argument your schema happily accepts because you marked it optional. Both of those pass your "did a tool call succeed" dashboard. Neither is what the user asked for.

The Unmergeable Agentic Refactor: Why Multi-File Diffs Break at the Seam

· 9 min read
Tian Pan
Software Engineer

A 40-file refactor from a coding agent lands on your desk. You open the PR, scroll through the diff, and every hunk looks fine. The rename is consistent, the imports are tidy, the tests compile in isolation. You merge. Forty minutes later, CI on main goes red because two call sites in a sibling package still pass three arguments to a function that now takes four, and the type checker that would have caught it was never part of the agent's inner loop.

This is the most common failure mode in agent-authored refactors today, and it has almost nothing to do with the quality of the individual edits. Each file, reviewed on its own, looks like something a careful human would have written. The bug lives at the seams — the boundaries where edits from different files have to agree. File-level review hides seam-level correctness, and most review workflows were designed around files.

The Validator Trap: How Post-Hoc Guards Rot Your Prompt From the Inside

· 9 min read
Tian Pan
Software Engineer

The first time a validator catches a bad LLM output, it feels like a win. The second time, you tweak the prompt to make the failure less likely. By the twentieth time, nobody on the team can explain why three paragraphs of the prompt exist — they are scar tissue from incidents long forgotten, and the model is spending more tokens reading warnings than reasoning about the actual task.

This is the validator trap. Every post-hoc guard you add — a JSON schema check, a regex, a content classifier, a second LLM-as-judge — exerts feedback pressure on the upstream prompt. The prompt grows defensive instructions to appease the guard, the guard in turn catches a new class of failure, and you add more instructions. Each iteration looks local and sensible. In aggregate, the system gets slower, more expensive, and measurably worse at the task you originally designed it for.

Agent Fleet Concurrency: Coordinating Dozens of Agents Without Deadlock or the Thundering Herd

· 12 min read
Tian Pan
Software Engineer

Eleven agents started at the same second. Three died before the first tool call returned. That 27% fatality rate was not a model problem, a prompt problem, or a tool problem. It was a scheduling problem — the same kind of problem an operating system solves when fifty processes wake up at once and fight over a single CPU. The difference is that the OS has forty years of accumulated wisdom and the agent runtime has about two.

Anyone who has wired up more than a handful of concurrent LLM workers has seen some version of this. You kick off a scheduled job at 02:00, thirty agents spin up, they all hit the same provider within 200 ms of each other, and most of them fail with a mix of 429s, 502s, and connection resets. The survivors get half the rate budget they were promised because the provider's fair-share logic has already started throttling your API key. By 02:05 the surviving agents finish and your dashboard shows a completion rate that would embarrass a first-year CS student writing their first producer-consumer. Your on-call rotation debates whether to add retries, add a queue, or just run fewer of them.

None of those are the right answer by themselves. The right answer is that a fleet of agents is a small distributed system and needs to be designed like one.

The AI Changelog Problem: Why Your Prompt Updates Are Breaking Other Teams

· 11 min read
Tian Pan
Software Engineer

A platform team ships a one-line tweak to the system prompt of their summarization service. No code review, no migration guide, no version bump — it's "just a prompt." Two weeks later, the legal product team finds out their compliance auto-redaction has been silently letting names through. The investigation eats a sprint. The fix is trivial. The damage is the trust.

This is the AI changelog problem in miniature. Behavior is now a first-class output of your system, and behavior changes when prompts, models, retrievers, or tool schemas change — none of which show up in git diff of the consuming application. Teams that treat AI updates like backend deploys, where a Slack message in #releases is enough, end up reinventing the worst parts of the early-2010s "we'll just push and tell QA later" workflow.

Design Your Agent State Machine Before You Write a Single Prompt

· 10 min read
Tian Pan
Software Engineer

Most engineers building their first LLM agent follow the same sequence: write a system prompt, add a loop that calls the model, sprinkle in some tool-calling logic, and watch it work on a simple test case. Six weeks later, the agent is an incomprehensible tangle of nested conditionals, prompt fragments pasted inside f-strings, and retry logic scattered across three files. Adding a feature requires reading the whole thing. A production bug means staring at a thousand-token context window trying to reconstruct what the model was "thinking."

This is the spaghetti agent problem, and it's nearly universal in teams that start with a prompt rather than a design. The fix isn't a better prompting technique or a different framework. It's a discipline: design your state machine before you write a single prompt.

The AI Audit Trail Is a Product Feature, Not a Compliance Checkbox

· 8 min read
Tian Pan
Software Engineer

McKinsey's 2025 survey found that 75% of business leaders were using generative AI in some form — but nearly half had already experienced a significant negative consequence. That gap is not a model quality problem. It's a trust problem. And the fastest path to closing it is not more evals, better prompts, or a new frontier model. It's showing users exactly what the agent did.

Most engineering teams treat the audit trail as an afterthought — something you wire up for GDPR compliance or SOC 2, then lock in an internal dashboard that only ops reads. That's the wrong frame. When users can see which tool the agent called, what data it retrieved, and which reasoning branch produced the answer, three things happen: adoption goes up, support escalations go down, and model errors surface days earlier than they would from any backend alert.

AI Code Review in Practice: What Automated PR Analysis Actually Catches and Consistently Misses

· 9 min read
Tian Pan
Software Engineer

Forty-seven percent of professional developers now use AI code review tools—up from 22% two years ago. Yet in the same period, AI-coauthored PRs have accumulated 1.7 times more post-merge bugs than human-written code, and change failure rates across the industry have climbed 30%. Something is wrong with how teams are deploying these tools, and the problem isn't the tools themselves.

The core issue is that engineers adopted AI review without understanding its capability profile. These systems operate at a 50–60% effectiveness ceiling on realistic codebases, excel at a narrow class of surface-level problems, and fail silently on exactly the errors that cause production incidents. Teams that treat AI review as a general-purpose quality gate get false confidence instead of actual coverage.

AI Compliance Infrastructure for Regulated Industries: What LLM Frameworks Don't Give You

· 11 min read
Tian Pan
Software Engineer

Most teams deploying LLMs in regulated industries discover their compliance gap the hard way: the auditors show up and ask for a complete log of which documents informed which outputs on a specific date, and there is no answer to give. Not because the system wasn't logging — it was — but because text logs of LLM calls aren't the same thing as a tamper-evident audit trail, and an LLM API response body isn't the same thing as output lineage.

Finance, healthcare, and legal are not simply "stricter" versions of consumer software. They require infrastructure primitives that general-purpose LLM frameworks never designed for: immutable event chains, per-output provenance, refusal disposition records, and structured explainability hooks. None of the popular orchestration frameworks give you these out of the box. This article describes the architecture gap and how to close it without rebuilding your entire stack.