Skip to main content

578 posts tagged with "insider"

View all tags

The Inference Budget Committee: Governance When Token Spend Crosses Seven Figures

· 12 min read
Tian Pan
Software Engineer

At $50,000 a month, the "compute + tokens" line on your infra bill is rounding error. At $5,000,000 a month, it is a CFO question. The transition between those two states is not gradual — it is a phase change in how an organization talks about model spend, and most engineering orgs are unprepared for the social and political work that follows. The bill stays a single line; the conversation around it does not.

What changes is who has standing to ask "why." When three product teams share one API key and one capacity reservation, every quota argument has the same structure: someone is currently winning at the expense of someone else, and there is no neutral party to call it. The first time a team's launch is throttled because another team shipped a chatty agent, the absence of a governance body is felt by the entire engineering org at once. Calling a meeting and inventing a process under pressure is the worst time to design one.

Persona Drift: When Your Agent Forgets Who It's Supposed to Be

· 11 min read
Tian Pan
Software Engineer

The system prompt says "you are a financial analyst — be conservative, never give specific buy/sell advice, always disclose uncertainty." For the first twenty turns, the agent behaves like a financial analyst. By turn fifty, it is recommending specific stocks, mirroring the user's casual tone, and hedging less than it did in turn three. Nobody changed the system prompt. Nobody injected anything malicious. The persona simply eroded under the weight of the conversation, the way a riverbank does when nothing crosses the threshold of "attack" but the water never stops moving.

This is persona drift, and it is the regression your eval suite is not catching. Capability evals measure whether the model can do the task. Identity evals — whether the model is still doing the task the way the system prompt said to do it — barely exist outside of research papers. The result is a class of production failures that look correct turn-by-turn and look wrong only when you read the transcript end to end.

Sovereignty Collapse: Logging Where Your Prompt Actually Went

· 9 min read
Tian Pan
Software Engineer

A regulator asks a simple question. "For this specific user prompt, submitted at 14:32 UTC last Tuesday, prove which jurisdictions the request and its derived state passed through."

Your application logs say model=claude-sonnet-4-5, region=eu-west-1, latency=2.1s. Your gateway logs say the same. Your provider's invoice confirms the request happened. None of these answer the question. The request entered an EU-hosted gateway, was forwarded to a US-region primary endpoint that failed over to Singapore during a regional incident, and warmed a KV cache on a third-party GPU pool whose residency claims live in a vendor footnote. The audit trail you needed lives at a layer your team does not own.

This is sovereignty collapse: the gap between what your contracts promise about data location and what your runtime can actually prove after the fact. The compliance claim is only as strong as the weakest log line in the chain.

The Query Rewriting Layer Your RAG Pipeline Skipped

· 10 min read
Tian Pan
Software Engineer

When a RAG system answers wrong, the first instinct on most teams is to blame the encoder. Swap to a bigger embedding model. Try a domain-tuned one. Bump the dimension count. Three sprints later the recall curve has nudged a few points and the user complaints look the same.

The diagnosis was wrong. Most retrieval failures aren't embedding failures. They're query-shape failures — and no amount of vector tuning fixes a vocabulary mismatch that exists before the encoder ever runs.

A user types "how do I cancel." The relevant document is titled "Subscription Lifecycle Management" and uses words like "termination," "billing cycle close," and "service deactivation." There is no encoder in the world that pulls those two strings into the same neighborhood by lexical luck. The cosine similarity gap is real, and it lives in the input, not the model. The query rewriting layer that goes ahead of retrieval is the thing most pipelines skip and then spend a quarter trying to compensate for downstream.

The AI Off-Switch That Doesn't Exist: Retiring Features After Users Co-Author the Archive

· 11 min read
Tian Pan
Software Engineer

Six months after you launched the AI writing assistant, you open the analytics dashboard and find the metric you wanted: 40% of user-generated documents on the platform now contain AI-authored prose. The board meeting calls this engagement lift. Three weeks later, the model provider raises prices, the unit economics flip, and someone asks the obvious question: can we turn it off? You go looking for the toggle and discover that it isn't a toggle. It's a migration with product, legal, and UX surfaces attached, and pulling it cleanly will take two quarters and burn political capital with three teams who didn't know they were stakeholders.

This is the part of the AI product lifecycle that nobody planned for. The launch playbook covered prompt engineering, rate limits, eval harnesses, and a kill switch for runaway costs. It did not cover what happens when users have spent half a year producing artifacts that only exist because the generator existed, and now the read path through your archive depends on a feature you want to retire. The "off switch" was conceptual: a flag in a config file. The actual decommissioning is a coordinated set of decisions about grandfathering, versioning, content provenance, and the uncomfortable conversation about whether the engagement lift was ever value or just dependency.

The Acknowledgment-Action Gap: Your Agent's 'Got It' Is Not a Commitment

· 11 min read
Tian Pan
Software Engineer

An agent tells a customer: "Got it — I've submitted your refund request. You should see it in 5–7 business days." The customer closes the chat. No refund was ever submitted. There is no ticket, no API call, no row in the refunds table. Just a paragraph of polite, confident English, followed by a successful session termination.

This is the acknowledgment-action gap, and it is the single most expensive class of bug in production agent systems. The gap exists because the fluent prose that makes instruction-tuned models feel competent is a different output channel than the structured tool calls that actually change the world — and most teams wire their business logic to the wrong one.

Everyone who ships an agent eventually learns this the hard way. The model produces a polished confirmation that reads like a commitment, the downstream system interprets it as a commitment, and weeks later a support ticket arrives asking where the refund went. The embarrassing part is not that the model lied. The embarrassing part is that the system was designed to trust what it said.

The Agent Backfill Problem: Your Model Upgrade Is a Trial of the Last 90 Days

· 12 min read
Tian Pan
Software Engineer

Here is a Tuesday-morning conversation that nobody on your AI team is prepared for. The new model lands in shadow mode. Within an hour the eval dashboard lights up: it categorizes 4% of refund requests differently than the model you have been running for the last quarter. Most of those flips look like the new model is right. Someone in the room — usually the one with the most lawyers in their reporting line — asks the question that ends the celebration: so what are we doing about the ninety days of decisions the old model already shipped?

That is the agent backfill problem. The moment a smarter model starts producing outputs that look more correct than your previous model's, every durable decision the previous model made becomes a contested record. You did not intend to indict the past. The new model did it for you, automatically, the first time you compared traces. And now you have an engineering question (can we replay history?), a legal question (do we have to disclose corrected outcomes?), and a product question (do users see retroactive changes?), and they collide.

The Agent Capability Cliff: Why Your Model Upgrade Made the Easy 95% Perfect and the Hard 5% Your Worst Quarter

· 11 min read
Tian Pan
Software Engineer

You shipped the new model. Aggregate eval pass rate went from 91% to 96%. Product declared it a win in the all-hands. Six weeks later, the reliability team is having their worst quarter on record — not because there are more incidents, but because every single incident is now the kind that takes three engineers and two days to resolve.

This is the agent capability cliff, and it is one of the most counterintuitive failure modes in production AI. Model upgrades do not raise all tasks uniformly. They concentrate their gains on the bulk of your traffic — the easy and medium cases where the previous model was already correct most of the time — while the long tail of genuinely hard inputs sees only marginal improvement. Your failure surface narrows, but every remaining failure is a capability-frontier case that the previous model also missed and that no cheap prompt engineering will fix.

The cliff is not a flaw in the new model. It is a mismatch between how we measure model improvement (average pass rate on a mixed-difficulty eval set) and what actually lands in on-call rotations (the residual set of the hardest traffic, now unpadded by the easier failures that used to dominate the signal).

Agent Memory Schema Evolution Is Protobuf on Hard Mode

· 11 min read
Tian Pan
Software Engineer

The first painful agent-memory migration always teaches the same lesson: there were two schemas, and you only migrated one of them. The storage layer is fine — every row was rewritten, every key is in its new shape, the backfill job logged success. The agent is broken anyway. It keeps writing to user.preferences.theme, retrieves nothing, then helpfully synthesizes a default from context as if the key never existed. The migration runbook reports green. Users report stale memory.

The asymmetry is structural. A traditional service that depends on a renamed column gets a hard error and you fix it. An agent that depends on a renamed memory key gets a soft miss and confabulates around it. The schema lives in two places — your store and the model's context — and you can only migrate one of them with a SQL script.

Protobuf solved a version of this problem twenty years ago by codifying an additive-only discipline: fields are forever, numbers are forever, wire types never change, and removal is replaced with deprecation. That discipline is the right starting point for agent memory, with one extra constraint that makes it harder. Protobuf receivers ignore unknown fields by design. Agents don't.

Silent Success: When Your Agent Says Done and Nothing Actually Happened

· 10 min read
Tian Pan
Software Engineer

The most dangerous line in an agent transcript is the confident one. "I've updated the record." "The invite is sent." "Permissions are applied." Every one of those sentences is a claim, not a fact, and when the tool call behind it rate-limited, timed out, or returned a 500 that the summarization step over-compressed into something reassuring, the claim is all you have. Your telemetry logs the turn as successful because success is whatever the model typed at the top of its final message. The downstream write never committed. Nobody notices for three weeks.

This is the failure class that separates agents from every system that came before them. A traditional service fails with a status code. A traditional batch job fails with a stack trace. An agent fails by continuing to talk. It absorbs the error into its running narrative, rounds it off to make the story coherent, and hands you a paragraph that reads like completion. The user reads the paragraph. Your observability platform indexes the paragraph. The record in the database does not change.

The Agent Paged Me at 3 AM: Blast-Radius Policy for Tools That Reach Humans

· 12 min read
Tian Pan
Software Engineer

The first time an agent pages your on-call four times in an hour because it's looping on a malformed alert signal, leadership learns something the security team already knew: "tool access" and "ability to create human work" were the same permission, and you granted it without either a safety review or a product-ownership review. Nobody owned the question of who's allowed to interrupt a human at 3 AM, because nobody framed it as a question. It was framed as a Slack integration.

The 2026 agent stack has made this failure mode cheap to reach. Anthropic's MCP servers, OpenAI's Agents SDK, and the whole class of vendor-shipped action tools have collapsed the distance between "the model decided to do a thing" and "a human got woken up." Most teams ship those integrations the same way they ship a database client: scope a token, drop in the SDK, write a system prompt, ship. The blast radius of a database client is a row count. The blast radius of a PagerDuty client is a person's sleep.

Your AI Chat Transcripts Are Evidence: Retention Design for LLM Products Under Legal Hold

· 11 min read
Tian Pan
Software Engineer

On May 13, 2025, a federal magistrate judge in the Southern District of New York signed a preservation order that replaced a consumer AI company's retention policy with a single word: forever. OpenAI was directed to preserve and segregate every output log across Free, Plus, Pro, and Team tiers — including conversations users had explicitly deleted, including conversations privacy law would otherwise require to be erased. By November, the same court ordered 20 million of those de-identified transcripts produced to the New York Times and co-plaintiffs as sampled discovery. The indefinite retention obligation lasted until September 26 of that year. Five months of "delete" meaning "keep, in a segregated vault, for an opposing party to read later."

That order is the warning shot for every team building on top of LLMs. If your product stores chat, your retention policy is one plausible lawsuit away from being replaced by whatever the court thinks is reasonable. The engineering question is not whether this happens to you. It is whether your storage architecture can absorb it without turning your product into a liability engine for the legal department.

Email retention playbooks do not carry over cleanly. AI conversations contain more than what the user typed, and the "more" is where the discovery fights are starting.