Skip to main content

861 posts tagged with "insider"

View all tags

Shadow MCP: The Tool Servers Your Security Team Has Never Heard Of Are Already Running on Your Engineers' Laptops

· 13 min read
Tian Pan
Software Engineer

Your security team has a complete inventory of every SaaS subscription on the corporate card, every OAuth app with admin consent, every device on the corporate Wi-Fi. They have zero visibility into the seven processes bound to 127.0.0.1 on your senior engineer's laptop right now — a "deploy assistant" with a long-lived staging API token, a "ticket triager" subscribed to a customer-data Slack channel, a "release notes generator" with read access to the production analytics warehouse. None of it is on a vendor list. None of it shows up in the SSO logs. All of it is running on credentials the engineer already had, doing things nobody approved them to do.

This is shadow MCP, and it is the fastest-growing unmanaged authorization surface in the enterprise. The Model Context Protocol made it trivially cheap to wire any tool into any LLM, and engineers — being engineers — wired the obvious things first. Saviynt's CISO AI Risk Report puts the number at 75% of CISOs who have already discovered unsanctioned AI tools running in their production environments. The GitHub MCP server crossed two million weekly installs in early 2026. The Postgres MCP server, which gives an LLM a SQL prompt against any database the developer can reach, is north of 800,000 weekly installs. None of those numbers represent enterprise IT decisions.

The Shared-Prompt Flag Day: When One Edit Becomes Thirty Teams' Regression

· 10 min read
Tian Pan
Software Engineer

The first edit to a shared system prompt feels like good engineering. Three teams all paste the same eighteen-line safety preamble at the top of their agents, someone notices, and an internal platform team says the obvious thing: let's centralize it. A prompts.common.safety_preamble@v1 lands in a registry. Thirty teams adopt it within a quarter because it's the path of least resistance — and because security is happy that one team owns the wording. For two quarters, this looks like a clean DRY win.

Then the security team needs a small wording change. Maybe a new compliance regulation tightens what an assistant is allowed to volunteer about a user's account. Maybe a red-team finding requires a one-sentence addition to the refusal clause. The platform team makes the edit, ships v2, and within a day the support queue fills with messages from consumer teams: our eval dropped, our format broke, our tool-call rate halved, our tone changed, our latency went up because the model started reasoning more. Each team wants the edit reverted. The security team needs it shipped. Nobody can roll forward without a re-eval, and nobody owns the re-eval. Welcome to the shared-prompt flag day.

Streaming JSON Parsers: The Gap Between Tokens and Typed Objects

· 12 min read
Tian Pan
Software Engineer

The model is emitting JSON token by token. Your UI wants to render fields the moment they materialize — a confidence score before the long answer body, the arguments of a tool call as the model fills them in. Then someone wires up JSON.parse on every chunk and the whole thing falls over, because JSON.parse is all-or-nothing. It needs a balanced document to return anything. Until the model emits the closing brace, you have nothing to show.

This is not a parser problem you can fix with a try/catch. The standard JSON parser was designed against a content-length-known HTTP response. Partial input is not a state it models — it is "input error." When you treat a token stream as if it were an HTTP body, you inherit thirty years of "the document is either complete or invalid," and your UI pays the bill.

Structured Concurrency for Parallel Tool Fanout: Who Owns Partial Failure?

· 11 min read
Tian Pan
Software Engineer

The moment your agent fans out five parallel tool calls — search across three indexes, query two databases, hit one external API — you have crossed an invisible line. You are no longer writing prompt-and-response code. You are writing a concurrent program. Most agent frameworks pretend you are not, and the bill arrives at 2 AM.

The pretense is comfortable. The planner emits a list of tool calls, the runtime fires them off, the runtime collects whatever comes back, the planner consumes the aggregate. From a thousand feet up it looks like a fan-out / fan-in pipeline, and most teams treat it that way until production teaches them otherwise. The problem is that twenty years of concurrent-programming research — partial-failure semantics, structured cancellation, backpressure, deterministic error attribution — already solved the failure modes you are about to rediscover. Your agent framework, by default, did not import any of it.

Token Amplification: The Prompt-Injection Attack That Burns Your Bill

· 10 min read
Tian Pan
Software Engineer

A user submits a $0.01 request. Your agent reads a webpage. Forty seconds later, the inference bill for that single turn is $42. The query was technically successful — the agent returned a reasonable answer. It just took three nested sub-agents, a 200K-token document fetch, and a recursive plan refinement loop to get there. None of that fanout was the user's idea. It was a sentence buried in the page the agent read.

This is token amplification: a prompt-injection class that does not exfiltrate data, does not call unauthorized tools, and does not leave a clean security signature. It just sets your bill on fire. The cloud bill is the payload, and the user's request is the carrier.

Tokenizer Churn: The Silent Breaking Change Inside Your 'Compatible' Model Upgrade

· 11 min read
Tian Pan
Software Engineer

The vendor said the upgrade was a drop-in replacement. The API contract held. The model name in your config barely changed. A week later, your context-window guard starts triggering on prompts it never tripped on before, your stop-sequence regex matches in the wrong place, and one of your few-shot examples started producing a confidently wrong answer that your eval suite happens not to cover. Nobody touched the prompt. Nobody touched the temperature. Somebody quietly retrained the tokenizer.

Tokenizer changes are the breaking change vendors don't call breaking. The API surface is byte-stable, the SDK didn't bump a major version, and the release notes mention "improved instruction following" — but the function from your input string to the integer sequence the model actually sees has been replaced. Every assumption your code made about how text becomes tokens is now subtly wrong. The cost of that invisibility is two weeks of "the model feels different" before someone re-runs a canonical prompt through count_tokens and finds the answer.

Your Tool Catalog Is a Power Law and You're Optimizing the Long Tail

· 11 min read
Tian Pan
Software Engineer

Pull a week of tool-call traces from any production agent and the shape is the same: three or four tools handle 90% of the calls, and a couple of dozen others split the remaining 10%. The catalog is a power law, but the framework treats it like a uniform list. Every tool description ships in every system prompt, every selection rubric weights tools equally, every eval samples the catalog as if a search-files call and a refund-issue call were drawn from the same distribution. They are not.

The cost of that flatness is invisible until it isn't. A team adds the eighteenth tool, the planner's accuracy on the original three drops two points, nobody can localize the regression to a specific change because everything moved at once, and the eval suite — itself uniform across the catalog — averages the slip into a number that still looks fine. Meanwhile the tokens spent describing tools the model will not call this turn now exceed the tokens spent on the user's actual prompt.

Tool-Composition Privilege Escalation: Your Security Review Cleared the Nodes, Not the Edges

· 10 min read
Tian Pan
Software Engineer

read_file is safe. send_email is safe. Your security review cleared each one against its own threat model: read-only access to a known directory, outbound mail through an authenticated relay with rate limits and recipient logging. Each passed. Both got registered. Then the agent composed them, and a single line of injected text in a customer support ticket turned the pair into an exfiltration tool that the original review had no language to describe.

The danger does not live in any node of the tool graph. It lives in the edges. Every per-tool security review you ran produced a verdict on a vertex; the actual permission surface of your agent is the set of paths through the catalog, and that set grows quadratically while your review process scales linearly. By the time your agent has fifteen registered tools, you have reviewed fifteen things and shipped roughly two hundred reachable two-step compositions, none of which any human auditioned.

Abstain or Escalate: The Two-Threshold Problem in Confidence-Gated AI

· 13 min read
Tian Pan
Software Engineer

Most production AI features ship with a single confidence threshold. Above the line, the model answers. Below it, the user gets a flat "I'm not sure." That single number is doing two completely different jobs at once, and it's why your trust metric has been sliding for two quarters even though your accuracy on answered queries looks fine.

The right design has at least two cutoffs. An abstain threshold sits low: below it, the model declines because no answer is worth more than silence. An escalate threshold sits in the middle: between the two cutoffs, the system hands the case to a human reviewer instead of dropping it on the floor. Collapse them into a single dial and you ship a product that feels equally useless when it's wrong and when it's uncertain — which is the worst possible position to occupy in a market where users have a free alternative one tab away.

This isn't a new idea. The reject-option classifier literature has been arguing for split thresholds since the 1970s, distinguishing ambiguity rejects (the input is between known classes) from distance rejects (the input is far from any training data). Production AI teams keep rediscovering the same lesson the hard way, usually about six months after their first launch, when the support queue is full of people typing "is this thing broken or what."

Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent

· 11 min read
Tian Pan
Software Engineer

A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.

The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.

Your Agent's Outbox Is Your Next Deliverability Incident

· 11 min read
Tian Pan
Software Engineer

The first time it happens, the on-call engineer is staring at a Gmail Postmaster dashboard that has gone solid red, the support inbox is on fire because customer password resets are landing in spam, and the agent that did this is still running. It sent eighty thousand "personalized follow-ups" between 4 a.m. and 9 a.m. local time, all from the company's primary sending domain, all signed with the same DKIM key the billing system uses. By the time anyone notices, the domain reputation that took three years to build is gone, and so are the next six weeks of inbox placement on every transactional message the company depends on.

Sending email from an agent looks like a one-line tool call. send_email(to, subject, body) is the canonical demo, and every framework ships it as a starter integration. But email is not like other tools. A bad database query rolls back. A bad API call returns an error. A bad batch of email lowers the deliverability of every other email your company sends, for weeks, and there is no transaction to roll back because the messages are already in flight to recipient mailservers that are now writing your domain's reputation history.

Why AI-Generated Comments Rot Faster Than the Code They Describe

· 11 min read
Tian Pan
Software Engineer

When an agent writes a function and a comment in the same diff, the comment is not documentation. It is a paraphrase of the code at write-time, generated by the same model from the same context, and it is silently wrong the first time the code shifts. The function gets refactored, an argument changes type, an early-return gets added, the comment stays. By next quarter, the comment is encoding a specification that no longer matches the code, and the next reader trusts the comment because the comment is easier.

This is an old failure mode — humans-edit-code-comments-stay-stale — but agents accelerate it across three dimensions at once. Comment volume goes up because agents add a doc block to every function whether it needs one or not. The comments are grammatically perfect, so reviewers don't flag them as low-quality. And the comments paraphrase the code in different terms than the code actually executes, so they look like documentation but encode a second specification that drifts independently of the first.