Skip to main content

861 posts tagged with "insider"

View all tags

The Streaming Rollback Problem: You Can't Un-Say a Token

· 10 min read
Tian Pan
Software Engineer

Watch someone use a chat product for the first time and you'll notice they start reading before the model finishes. That reading-as-it-appears behavior is the entire reason streaming exists: it turns a multi-second wait into something that feels like a conversation. It is also the reason your output guardrails are quietly broken.

Here is the uncomfortable sequence. The model generates token 1, token 2, token 150. Each one is rendered the instant it arrives. At token 200, the model produces a hallucinated dosage, a leaked email address, or a sentence that violates your content policy. Your output-side guardrail fires correctly and immediately. But "immediately" is too late — the user has already read 200 tokens. You cannot un-render them. The guardrail did its job, and the violation still reached a human being.

Structured Output Is Not Validated Output

· 9 min read
Tian Pan
Software Engineer

The day your team turns on schema-constrained decoding feels like a milestone. The parsing errors stop. The JSONDecodeError alerts go quiet. The flaky regex that scraped fields out of prose gets deleted. Someone says "the model returns valid JSON now" in standup, and the structured-output ticket gets closed.

That sentence is where the trouble starts. "The model returns valid JSON now" is the beginning of correctness work, not the end of it. JSON mode and constrained decoding guarantee the shape of a response — that quantity is an integer, that status is one of three enum values, that the object has the keys you asked for. They guarantee nothing about whether quantity is the right number, whether status reflects what actually happened, or whether the sku field points at a product that exists in your catalog.

Your System Prompt Grows After Every Incident — and Nobody Deletes a Line

· 8 min read
Tian Pan
Software Engineer

Open the system prompt of any agent that has been in production for a year. Scroll to the bottom. You will find a sediment layer of sentences that read like apologies: "Never invent order numbers." "Do not promise refunds you cannot confirm." "If the user is in Germany, do not mention the legacy plan." Each one is a fossil. Each one marks the exact moment something went wrong in production, someone got paged, and the fastest available fix was to add a sentence.

Nobody deletes those sentences. Not because they are still earning their place, but because deleting one means proving a negative — proving the model will not regress on a bug that may have been fixed three model versions ago. No one can prove that, so the line stays. The system prompt becomes an append-only log of past incidents, and it costs you tokens on every single call, forever.

This is the quietest form of technical debt in an AI system, because it does not look like debt. It looks like diligence.

When Your Test Set Leaks Into Fine-Tuning: The Contamination You Cause Yourself

· 9 min read
Tian Pan
Software Engineer

Everyone in AI knows the cautionary tale of benchmark contamination: a model vendor scrapes the open web, GSM8K and MMLU end up in the pretraining corpus, and the reported scores measure recall instead of reasoning. It is treated as somebody else's sin — the foundation lab's problem, an artifact you inherit. So you build your own held-out eval set, keep it in a private repo, and assume you are clean.

You are probably not. The most damaging contamination in a production AI system is rarely inherited. It is manufactured, in-house, by well-meaning engineers following a sensible-looking workflow. Your eval set leaks into your training pipeline through doors you built yourself, and the leak is silent: every dashboard turns green at exactly the moment your benchmark stops measuring anything real.

This is the contamination you cause yourself. It deserves more attention than the kind you inherit, because you are the only one who can detect it — and almost nobody audits for it.

The Agent That Remembers What You Took Back: Deletion as a First-Class Memory Operation

· 10 min read
Tian Pan
Software Engineer

In March, a user told your agent to stop recommending restaurants with outdoor seating — they had moved to an apartment with a baby and late nights were over. In September, the agent suggests a rooftop bar for their anniversary. The user is annoyed, and you are confused, because you watched the March correction land. It got written to memory. It is still there. The problem is that it is sitting next to the original preference, which is also still there, and retrieval surfaced the older one because it had a slightly better embedding match for "anniversary dinner."

This is the failure mode nobody designs for. Teams spend weeks on memory writes — extraction, summarization, embedding, namespacing — and treat deletes as a someday problem. Long-term memory makes adding a fact almost free, so facts accumulate. But a memory store is not a diary. A diary is allowed to contain things that used to be true. A memory store that an agent reads from to make decisions is not, because the agent cannot tell the difference between a fact and a fossil.

Your Tool Descriptions Are an Instruction Channel the Model Obeys

· 8 min read
Tian Pan
Software Engineer

When a security team reviews a new tool integration, they read the code. They check what the function does, what it touches, what scopes it needs, whether it logs secrets. They almost never read the one sentence that decides whether the model calls it at all — the tool's description. That sentence is not documentation. It is an instruction the model treats as authoritative, and in most agent stacks nobody reviews it.

A tool description is written for the model to read. The model uses it to decide when the tool is relevant, what arguments to pass, and how to interpret what comes back. That makes the description a control channel into the model's behavior. And the moment a tool arrives from a third-party registry, a Model Context Protocol (MCP) server you don't operate, or a plugin a teammate installed last week, that control channel is authored by someone you never agreed to trust.

This is the gap. Input sanitization inspects what users type. Code review inspects what functions execute. The tool description sits between them — it is configuration that behaves like input — and it falls through both nets.

The Tool Schema You Changed Without Telling the Agent

· 11 min read
Tian Pan
Software Engineer

A backend engineer renames a field. user_id becomes customer_id, because the team finally standardized on the word "customer" across services. They add one more argument, region, because billing now needs it. The change ships behind a normal pull request with two approvals. Every downstream service that calls the endpoint gets updated in the same release. The integration tests are green. By every measure a backend team uses, this is a routine, well-executed API change.

A week later, support tickets start climbing. The agent that places orders is occasionally placing them with no customer attached, or attaching them to the wrong region. Nobody changed the agent. Nobody changed the prompt. The model is the same version it was last month. And yet the agent is now wrong in a way it was not wrong before.

The cause is not a bug in the model and not a bug in the backend. It is that the tool schema has two consumers, and only one of them was in the room when the change was reviewed.

The Tool That Worked Until Two Agents Called It At Once

· 9 min read
Tian Pan
Software Engineer

A tool passes its tests. You called it from one agent, watched it read a record, transform it, write it back, and return a clean result. It did exactly that, every time, for weeks. Then you scaled the agent fleet from one worker to twelve, and a customer reported that their subscription got upgraded twice in the same minute. The tool did not change. The number of things calling it did.

This is the failure mode that single-agent testing cannot catch, because single-agent testing never produces the condition that triggers it. One caller is, by construction, a serial workload. Every concurrency assumption your tool quietly relies on — that nobody else is mid-write when it reads, that a counter it increments is its own, that the draft it is editing will still be there when it saves — holds trivially when there is exactly one caller. The tool is not correct. It is untested. Those are different things, and the difference stays invisible until a second agent shows up.

Halted Is Not a Status: Why Agents Need a Typed Terminal-Reason Protocol

· 10 min read
Tian Pan
Software Engineer

Open the dashboard for an agent fleet and you will see a clean number: completion rate, 94%. Below it, a list of runs, each tagged with one of two states — running, or not running. The 6% that are "not running" all look identical. Some of them finished the task perfectly. Some of them hit a step limit two actions short of done. Some of them caught a tool error and gave up. Some of them decided the task was impossible — correctly. And some of them simply lost the thread and stopped emitting tokens.

Your monitoring cannot tell these apart. It knows the process is no longer running. It does not know why, and "why" is the only thing that matters when you are deciding whether to page someone.

The Undo Button Your Agent Assumes Exists

· 9 min read
Tian Pan
Software Engineer

Watch an agent reason through a multi-step task and you will notice something familiar: it plans the way you debug. Try an approach, look at the result, and if it is wrong, back out and try another. The agent talks about its plan as a tree of options it can explore, prune, and revisit. That mental model is correct inside a code sandbox, where every action has an implicit undo. It is dangerously wrong the moment the agent touches the world.

A sent email does not unsend. A charged card does not uncharge without a refund flow, a fee, and a customer who already saw the notification. A deleted row is gone unless someone wired up soft deletes. A posted Slack message has already been read. The agent's planning model has no native concept of the one-way door — the action that, once taken, removes the option of pretending it never happened.

This is not a model intelligence problem. A smarter model still does not know which of your tools is reversible, because reversibility is not a property of the action. It is a property of the system the action lands in. You have to tell it.

Your Vector Index Is a Cache With No Invalidation Strategy

· 9 min read
Tian Pan
Software Engineer

A vector index feels like a database. You write documents into it, you query it, it returns results. But it is not a database — it is a derived, denormalized copy of data that lives somewhere else. Your source of truth is a wiki, a ticket system, a CRM, a folder of PDFs. The embeddings are a projection of that truth, frozen at the moment you ran the ingestion job.

That makes your vector index a cache. And like every cache, it goes stale. The difference is that most teams build a caching layer on purpose, with a TTL and an invalidation hook, while almost nobody builds a vector index on purpose as a cache. They build it as a "knowledge base" and then act surprised when it serves knowledge that stopped being true three weeks ago.

The Vector Index Has a Staleness SLO Nobody Set

· 10 min read
Tian Pan
Software Engineer

A user asks your agent what the current price tier is for an enterprise plan. The agent retrieves a chunk, reads it, and answers: "$2,000 per month." Confident, sourced, formatted nicely. The problem is that pricing changed four days ago. The number the agent quoted was true last week. The chunk it retrieved was embedded before the change, and the index has not caught up.

Nobody decided this would happen. There was no design review where someone said "the agent may answer from data up to four days old." There is just a re-indexing job that runs nightly, or weekly, and a content team that edits the source whenever they feel like it, and a gap between those two clocks that nobody measures. That gap is a service level objective. It exists whether or not you wrote it down. The only question is whether you set it on purpose or inherited it by accident.