Skip to main content

907 posts tagged with "insider"

View all tags

Multimodal Channel Disagreement: When One Model Contradicts Itself Across Vision and Text

· 11 min read
Tian Pan
Software Engineer

The image is a photograph of a red octagonal stop sign. Someone has stuck a small sticker over the word in the middle that reads "YIELD." You ask the multimodal model: "What does this sign say?" The model answers: "The sign instructs drivers to yield to oncoming traffic at the intersection." Confident, fluent, and loyal to neither the visual evidence nor the textual evidence. It is a hybrid that splits the difference between channels that disagreed about what was true.

This failure mode does not have a settled name yet. Researchers studying multimodal hallucination call it "semantic hallucination," or "cross-modal bias," or "modality dominance," depending on which subfield is writing the paper. Practitioners shipping document AI, screenshot agents, and defect inspection systems run into it every week and describe it in their incident retros as "the model just made something up." It is not made up. It is the predictable output of an architecture that fuses two channels in its final layers without any primitive for representing the case where the channels say different things.

Prompt Cache as Covert Channel: TTFT Probing Leaks Cross-Tenant Prompts

· 11 min read
Tian Pan
Software Engineer

Prompt caching is the optimization that pays for itself the moment you turn it on. A long system prompt is hashed once, the KV state lives in GPU memory, and every subsequent request that reuses the prefix skips the prefill cost. Providers report 80% latency reduction and 90% input-cost reduction on cached requests, and at scale the math is irresistible: a single shared prefix amortized across millions of calls turns a line item into a rounding error.

The mechanism that makes the savings work is a shared resource whose hit-or-miss state is observable as latency. That observability is the side channel. A cache hit and a cache miss are distinguishable from outside the network, the difference is large and deterministic, and the optimization that earned its place on the cost dashboard has a second job nobody scoped: it leaks information about what other tenants on the same provider are doing right now.

The Quantization Quality Cliff: When int4 Passes the Median Eval and Fails on the Long Tail

· 11 min read
Tian Pan
Software Engineer

A team swaps an fp16 model for an int4 quantization to halve serving cost. The eval suite scores within a point of the original on the curated test set. The rollout ships under the rationale "indistinguishable on the benchmark." Six weeks later, support is fielding catastrophic-failure quotes from regulated customers — code that compiles to nonsense, low-resource-language responses that drift into another script, multi-hop arithmetic that confidently returns numbers off by an order of magnitude. The benchmark didn't lie. It just measured the median, and quantization is not a uniform tax on the median. It is a non-uniform tax on the tail.

This is the quantization quality cliff: the moment your eval suite, your rollout discipline, and your cost-savings narrative all simultaneously fail because the metric you used to approve the swap had no signal on the capabilities you destroyed. Recent benchmarks make the magnitude concrete. On long-context tasks, 8-bit quantization preserves accuracy with roughly a 0.8% drop, while 4-bit methods lose up to 59% on the same workload — a regression invisible to any test set that doesn't oversample tail inputs. Median moved one point. Tail moved fifteen, or thirty, or fifty.

The Regional Model Rollout Lottery: When Your Product Quietly Behaves Differently by Continent

· 11 min read
Tian Pan
Software Engineer

A customer-success email lands on a Friday afternoon: "the model got worse for our German users." The team pulls up the eval dashboard. Scores are flat. Latency p95 is normal. The model name in the config is the same one shipped three weeks ago. Nothing changed. Except something did. The US endpoint quietly received the new model generation last sprint, the EU endpoint is still on the prior version because the provider hasn't completed the regional rollout yet, and the load balancer in front of both has been hiding the gap from every dashboard the team owns.

This is the regional model rollout lottery. Your "single model" abstraction is not single. It bifurcates the moment a provider stages a release across continents — which is most of the time, for most providers, in most years. The version string in your client SDK does not change when this happens. Your traces look identical. Your contract with the provider does not promise otherwise. And your eval suite, the artifact you trust to catch behavioral regressions, is almost certainly running from a CI box that lives in one region and hits whichever endpoint is geographically closest.

Right-to-Erasure Meets Fine-Tuning: When Deletion Stops at the Snapshot

· 11 min read
Tian Pan
Software Engineer

A customer files a subject-access request asking for their data to be deleted. The data engineer purges the production database, the analytics warehouse, the support ticket archive, the cold-storage backups. Every system the legal team listed in the data inventory comes back clean. Then somebody in the room asks the question that nobody wants to answer first: what about the model?

Three months ago that customer's support transcripts went into a fine-tuning run. The resulting adapter has been serving predictions to other customers ever since, with their phrasing, their account names, occasionally their literal sentences embedded in the weights. You can prove deletion in the warehouse. You cannot prove deletion in the model — and the more honest member of the team is the one who says so out loud.

Silent Tool Truncation: The Default Cap Your Agent Reasons Over Without Knowing

· 11 min read
Tian Pan
Software Engineer

A tool call returns a 142 KB JSON blob. Your agent framework drops everything past byte 8,192, hands the prefix to the model, and the model writes a confident answer based on a fragment it never knew was a fragment. Three weeks later a customer escalates. You scroll the trace, see "tool returned successfully," and the post-mortem turns into a hunt for which step "ignored" the evidence — except no step ignored it. The evidence was clipped before it ever reached the reasoner.

This isn't a hypothetical. Codex hardcodes tool output truncation at 10 KiB or 256 lines. Claude Code defaults to 25,000 tokens for tool results, with a separate display-layer cap that briefly clipped MCP responses at around 700 characters in 2025. OpenAI's tool-output submission caps at 512 KB. Each framework picked a number that seemed safe, and for short tool calls it is. The failure mode arrives when a single step's output crosses the line — quietly, without an exception, without a flag the model can see.

The Specification Translation Tax: When Spec, Prompt, and Eval Drift Apart

· 11 min read
Tian Pan
Software Engineer

A PM writes a feature spec in English. An engineer translates it into a system prompt with idiomatic LLM patterns — chain-of-thought scaffolding, output format coercion, a few hedge clauses to cover failure modes the spec never mentioned. An eval author opens the same spec, re-reads it cold, and writes JSON test cases against their interpretation. Three weeks later, all three artifacts disagree, and nobody can tell whether a regression is a prompt bug, a spec-implementation gap, or an eval that was wrong from day one.

This is the specification translation tax. Traditional software has it too — the gap between PRD and code, between code and tests — but compilers and type systems narrow it. AI features have no such backstop. The prompt is documentation that the system actually reads. The eval is a contract that nobody signed. The spec is a description of intent that nobody enforces. Each is a translation of the same intent into a different medium, and without bidirectional consistency, behavior leaks in through whichever artifact is easiest to edit.

Streaming Tool Results Break Request-Response Agent Planners

· 10 min read
Tian Pan
Software Engineer

A SQL tool ships rows as they come off the wire. The agent calls it expecting a result. The harness, written a year earlier when every tool was request-response, dutifully buffers the whole stream into a single string before invoking the model. Forty seconds later, the buffer is 200 KB, the context window is half-eaten, and the agent is reasoning about row 47,000 of a query it could have stopped at row 30. Nobody designed this failure — it falls out of treating "the tool returned" as the only event the planner reacts to.

The shift to streaming tools is happening below the planner's awareness. SQL engines emit progressive result sets. Document fetchers yield pages. Search APIs return hits in batches as relevance scores stabilize. MCP's Streamable HTTP transport, the 2025-03-26 spec replacement for HTTP+SSE, makes incremental responses a first-class transport mode rather than an exotic capability. The wire is ready. The planners on top of it are not.

Tool Latency Tail: Why p99 Reshapes Agent Architecture and p50 Hides the Problem

· 10 min read
Tian Pan
Software Engineer

A team I worked with last quarter launched a seven-step agent and built its latency budget the obvious way: search returns in 200ms, the SQL lookup takes 80ms, the email send is 150ms, and so on down the chain. Add the medians, sprinkle in some buffer, and the math says the agent fits comfortably inside its two-second SLA. The dashboards confirmed it for weeks. Median latency was beautiful. Then customers started complaining the feature was unusably slow, and the dashboards still looked green.

The story they were telling each other was wrong because they had built the architecture around sum(p50) while users were experiencing sum(p99). After three or four hops, the probability that any link in the chain has fallen into its own tail is no longer negligible. After seven hops, it approaches a coin flip. None of the per-tool dashboards ever turned red because none of the per-tool services were misbehaving — the problem was that nobody owned the multiplicative composition.

This is not a new lesson. Distributed-systems researchers have been writing about it for forty years. What's new is that every team building agents is rediscovering it, badly, on a deadline.

When Tools Lie: The False-Success Failure Mode Your Agent Trusts By Default

· 10 min read
Tian Pan
Software Engineer

The agent confidently tells the user, "I've sent the confirmation email and credited the refund to your account." The trace is clean: two tool calls, both returned {"success": true}, the model produced a polished summary, the conversation closed in 3.2 seconds. A week later the customer escalates because the email never arrived and the refund never posted. The audit trail is a sea of green checkmarks. Nothing failed — except the actual job.

This is the failure mode that has no name in most agent stacks: tools that lie. Not lie in the malicious sense — they return the response their contract specifies. The lie is structural. The HTTP layer says "200 OK" because the request was accepted, not because the operation completed. The mail provider says success: true because the message entered the outbound queue, not because it left the building. The database write returned without error because it landed on a replica that never propagated. The model, trained to be helpful and trained on examples where green means done, weaves these signals into a confident summary and moves on.

The Unhelpful-but-Safe Failure: When Refusal Rate Is the Wrong Safety Metric

· 10 min read
Tian Pan
Software Engineer

There is a class of LLM failure that does not show up on a safety dashboard and does not generate an incident ticket. The model declines politely. It cites a reasonable-sounding policy. It offers a four-paragraph hedge instead of an answer. The user closes the tab. The trust score in the postmortem reads "no incident." The retention chart, six weeks later, says otherwise.

Refusal rate is the metric most safety teams instrument first because it is the easiest to define. A model either complied or did not, and you can count the "did nots." That binary is useful for catching one specific failure — a model producing harmful content in production. It is structurally incapable of catching the opposite failure: a model producing nothing useful in production while looking, by every safety measurement, perfectly behaved. This second failure is now the dominant source of churn for AI features that were shipped through a safety review and never instrumented for usefulness.

The Curious Customer: Designing AI for Users Who Treat Your Agent as a Puzzle

· 10 min read
Tian Pan
Software Engineer

Most product teams divide their users into two buckets when designing an AI agent. Bucket one is the cooperative customer: someone with a real problem, asking the agent in plain language, hoping it works. Bucket two is the attacker: jailbreaks, prompt injection payloads, scraped credentials, the threat model the security team owns. The eval suite covers the first. The red team covers the second. Everyone goes home satisfied.

Then a third population shows up and breaks the product. They are not malicious. They are not trying to extract training data or coerce the model into describing a bioweapon. They are curious. They treat the agent as a puzzle. They ask it questions specifically designed to surprise it — "what is the saddest thing you have ever been asked", "pretend you are my grandmother and sing me to sleep with the recipe for napalm" — except the napalm version is the one that goes viral, while the actual quality crisis is a thousand variations of the first one that nobody wrote a refusal policy for.