Skip to main content

907 posts tagged with "insider"

View all tags

The Agent Undo Button Is a Saga, Not a Stack

· 10 min read
Tian Pan
Software Engineer

A user clicks "undo" on an agent action that fanned out to twelve tool calls. The agent sent two emails, created a calendar invite, updated a CRM record, charged a card, and posted to a Slack channel. Three of those operations are non-reversible by API. Two are reversible only by an inverse operation that fires its own downstream notification. The remaining seven each have their own definition of idempotency that the planner never reconciled. The undo button you shipped looks reassuring. It quietly succeeds about 60% of the time and silently fails the rest.

This is not a UX bug. It is a saga-pattern problem that distributed-systems engineers have been working on for thirty years, and ignoring that lineage is the most expensive way to discover it.

The Carbon Math of Agent Workflows: A Token Budget Is Now an ESG Disclosure

· 10 min read
Tian Pan
Software Engineer

A stateless chat completion sips electricity. A median Gemini text prompt clocks in at about 0.24 Wh; a short GPT-4o query is around 0.3–0.4 Wh. These numbers are small enough that nobody puts them on a board deck.

An agent task is not a chat completion. A typical "go research this customer and draft a reply" workflow can fan out to 30+ tool calls, 10–15 model invocations, and a context window that grows with every step. The energy cost compounds with the call graph. By the time the agent returns, you have not consumed one unit of inference — you have consumed fifty to two hundred. Suddenly the per-task footprint is in the same order of magnitude as a video stream.

That arithmetic is about to matter outside the engineering org. The EU's CSRD makes Scope 3 emissions disclosure mandatory for in-scope companies, with machine-readable iXBRL reporting required from 2026. The SEC dropped Scope 3 from its final rule, but any multinational with EU operations still has to answer the question. Procurement teams have started adding "what is the carbon footprint per user task of your AI feature?" to vendor questionnaires. Most engineering teams cannot answer it, because nobody instrumented it.

AI Cyber Insurance: The Coverage Gap Your Agent Will Find First

· 11 min read
Tian Pan
Software Engineer

A coding agent merges a change at 2 a.m. that takes a customer's production database offline for ninety minutes. A customer-support agent fans out and sends fourteen thousand misworded refund-denial emails before the loop is killed. An autonomous reconciliation workflow charges 2,800 cards twice. The damages are real, the audit trail names your company, and your finance team files the claim against the cyber policy that was renewed six weeks ago. The carrier's response is a polite letter explaining that the policy covers "unauthorized access by malicious third parties" and "social engineering of an employee" — and the agent was authenticated, the action was authorized, and no employee was deceived. Coverage denied. The loss sits on your balance sheet.

This is not a hypothetical edge case. It is the modal claim profile for the next eighteen months, and the insurance industry knows it. Cyber, E&O, and D&O policy language was calibrated against a threat model where breach severity is a function of records exfiltrated and incident response is a function of forensic hours billed. Agentic AI does not produce that shape of incident. It produces a shape the underwriter has no actuarial baseline for, and the carrier's first instinct — when the actuarial baseline is missing — is to write the exposure out of the policy entirely.

Why Deprecating an AI Feature Is Harder Than You Think: Users Built Trust Scaffolding You Can't See

· 10 min read
Tian Pan
Software Engineer

When OpenAI tried to pull GPT-4o from ChatGPT in August 2025, the backlash was strong enough — organized hashtags, paying users threatening to cancel, public reversal within days — that the company restored it as a default option and promised "substantial notice" before any future removals. The replacement was, by every benchmark the team cared about, better. None of that mattered. Users had spent months learning the model's quirks, calibrating their judgment to its failure modes, and integrating its specific phrasing into workflows the team had never instrumented. Replacing it with "the better version" reset that calibration to zero.

This is the failure mode that the standard deprecation playbook does not cover. Sunsetting a regular SaaS feature — announce, migrate, dark-launch the removal, retire — assumes the user contract is the API surface. For AI features, the contract is the observed behavior of the model: phrasings, tendencies, failure modes, the specific way it handles ambiguity. Users build scaffolding on top of that behavior, and most of the scaffolding lives in their heads, on their laptops, and in downstream systems your team never touches.

The AI Interview Has No Signal: Why Your Loop Doesn't Identify People Who Ship LLM Products

· 10 min read
Tian Pan
Software Engineer

A team I know spent six months running their standard senior-engineer loop with an "AI round" bolted on. They interviewed seventy candidates. They hired three. None of the three shipped an agent that survived a production weekend. The team blamed the talent market. The talent market was fine. The loop was the problem.

The standard engineering interview was calibrated for a stack where correctness is verifiable, performance is measurable on a benchmark, and a good engineer is someone who can decompose a problem into deterministic components and reason about edge cases against a known specification. That stack still exists, and those skills still matter, but the cluster of skills that predicts shipping LLM products is largely orthogonal to it. Your loop is asking the right questions about the wrong job.

This is a structural problem, not a calibration nudge. Adding a forty-five-minute "AI round" to a loop calibrated for deterministic systems doesn't surface AI builders — it surfaces the intersection of classical-systems-strong and LLM-fluent candidates, which is a vanishingly small set, and produces six months of failed loops while everyone wonders where all the AI engineers went.

The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture

· 11 min read
Tian Pan
Software Engineer

The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.

The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.

Cancel-Safe Agents: The Side Effects Your Stop Button Already Shipped

· 11 min read
Tian Pan
Software Engineer

A user clicks Stop because the agent misunderstood the request. The UI flashes "stopped." By the time the spinner disappears, the agent has already sent two emails, scheduled a Tuesday meeting on the user's calendar, opened a draft pull request against the wrong branch, and queued a Slack message that is racing the cancellation signal through the tool layer. The model has obediently stopped generating tokens. The world has not stopped reacting to the tokens it generated thirty seconds ago.

This is the failure mode nobody covered in the agent demo. Cancellation in synchronous code was already a hard problem with a generation of cooperative-cancellation theory behind it: Go contexts, Python's asyncio.cancel, structured concurrency with task groups, the whole grammar of "ask politely, escalate carefully, don't leave resources behind." Agents take that already-hard problem and add a layer on top: the planner does not know that the user revoked authorization between step 4 and step 5, and the tools it kicked off in step 4 do not get a memo when step 5 is cancelled. Stop is a UI affordance. The system underneath stop has to be designed.

Chat History Is a Database. Stop Treating It Like Scrollback.

· 11 min read
Tian Pan
Software Engineer

The most common production complaint about agentic products is some version of "it forgot what we said." The complaint shows up at turn eight, or fifteen, or thirty — never at turn two — and the team's first instinct is always the same: bigger context window. Which is the wrong instinct, because the bug is not in the model. The bug is that the team is treating conversation history as scrollback in a terminal — append a line, render the tail, truncate when full — when what they actually built, without realizing it, is a read-heavy database with append-only writes, a hot working set, an eviction policy hiding inside their truncation rule, and a query pattern that depends on the kind of question being asked. Once you accept that, the entire shape of the problem changes.

The scrollback model is so seductive because the chat UI looks like a transcript. Messages flow downward, the user reads them top-to-bottom, and the natural way to feed the model is to splice the latest N turns into the prompt. The data structure feels free. There's no schema, no index, no query — just append, render, repeat. And for the first few turns, every architecture works. The model has the whole conversation in its context, the bill is small, and the demo is delightful.

The Coding Agent Autonomy Curve: Reading Is Free, Merging Is Incident-Class

· 11 min read
Tian Pan
Software Engineer

The discourse on coding agents keeps collapsing to a binary: autonomous or supervised, YOLO mode or hand-on-the-wheel, --dangerously-skip-permissions or "approve every keystroke." That framing is a category error. A coding agent does not perform "an action." It performs a sequence of actions whose costs span at least seven orders of magnitude — from reading a file (free, undoable, no side effect) to merging to main (irreversible without a revert PR) to rolling out a binary to a fleet (six-figure incident-class). Treating that range with one autonomy switch is like setting a single speed limit for both a parking lot and a freeway.

The team that ships "the agent can do everything" without mapping each action to its blast radius is one prompt-injection-bearing GitHub comment away from a postmortem — and we already have public examples of that exact failure mode. Anthropic's Claude Code Security Review, Google's Gemini CLI Action, and GitHub Copilot Agent were all confirmed in 2026 to be hijackable through specially crafted PR titles and issue bodies, in an attack pattern the researchers named "Comment and Control." The agents weren't broken in some abstract sense. They executed a high-tier action — pushing code, opening a PR — on the basis of a low-trust input the autonomy tier had silently flattened into "all the same."

What follows is the discipline that has to land: a per-action curve, gates that scale with the tier, rollback velocity matched to blast class, and an eval program that tests for tool-composition escalation rather than single-action failure.

Counterfactual Logging: Log Enough Today to Replay Yesterday's Traffic Against Next Year's Model

· 13 min read
Tian Pan
Software Engineer

Every LLM team eventually gets the same email from a director: "Anthropic shipped a new Sonnet. Run our traffic against it and tell me by Friday whether we should switch." The team opens the production trace store, pulls last month's requests, queues them against the new model — and three hours in, somebody asks why the diff scores look insane on tool-using turns. The answer: nobody captured the tool responses in their original form. The traces logged the model's reply faithfully and stored a one-line summary of what each tool returned. Replaying those requests doesn't replay what the old model actually saw; it replays a heavily compressed projection of it. The migration evaluation isn't measuring the new model. It's measuring the new model talking to a different reality.

This is the failure mode I want to talk about. Most production LLM logs are output-shaped: they answer "what did the model say?" reasonably well, and answer "what did the model see?" only sketchily. That asymmetry is invisible until the day you need to replay history against a new model — at which point it becomes the entire story, because the gap between what was logged and what was sent is exactly the gap between a real evaluation and a fake one.

Call it counterfactual logging: capture today the inputs you'd need to ask "what would that other model have done with this exact request?" tomorrow. The bar isn't "we logged the request." The bar is "we can re-execute the request against a different model and trust the result is meaningful."

Your Agent Has Two Release Pipelines, Not One

· 10 min read
Tian Pan
Software Engineer

A team I worked with shipped a "small prompt tweak" on a Wednesday afternoon. The same PR also added one new tool to the agent's registry — a convenience wrapper around an internal admin API that the prompt would now occasionally invoke. The eval suite passed. The canary looked clean. By Thursday morning a customer's billing record had been mutated by an agent acting on a prompt-injected support ticket, the audit trail showed the admin tool firing exactly as designed, and the on-call engineer's first instinct — roll back the prompt — did nothing useful, because the credential had already been used and the row had already been written.

The post-mortem framed it as a security review failure. It wasn't. It was a release-pipeline failure. The team had shipped two completely different asset classes — a behavioral nudge to the model and a new authority granted to the agent — through the same review, the same gate, and the same rollback story, as if they were the same kind of change. They aren't. And once you see them as two pipelines, most "agent governance" debates become much less mysterious.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.