Skip to main content

861 posts tagged with "insider"

View all tags

The Agent Trace That's Too Big to Debug: When You Logged Everything and Can Read None of It

· 11 min read
Tian Pan
Software Engineer

The standard advice for agent observability is three words long: log the full trace. Capture every tool call, every prompt, every model response, every memory read and write. Teams comply. Then the first real incident arrives, an engineer opens the trace, and discovers it is forty tool calls deep and two hundred thousand tokens wide. The trace is technically complete. It is also practically unreadable.

What follows is a familiar ritual. The engineer scrolls. They expand a span, see fifty thousand characters of JSON, collapse it, scroll again. Ten minutes in, they find the one model turn where the agent picked the wrong tool — buried between thirty-seven turns that did exactly what they were supposed to. The trace that was supposed to make the failure legible instead made it expensive to investigate.

Why You Can't Budget an AI Feature With a Single Number

· 9 min read
Tian Pan
Software Engineer

Finance asks one question about every feature you ship: "What does it cost per user?" For a traditional feature, the answer is a number. A page render, a database query, a push notification — each has a marginal cost that barely moves from one request to the next. You measure it once, multiply by your user count, and the forecast holds.

An AI feature breaks that contract. Ask "what does this agent cost per request" and the honest answer is not a number, it's a histogram. The same agent that resolves one ticket for two cents will burn four dollars on the next one, because that user asked a vague question, the agent looped through eleven tool calls, and each call dragged the entire growing conversation back through the model. The mean of those two requests — two dollars — describes neither of them, and it definitely doesn't describe the bill.

That is the trap. When you hand finance a single average cost, you are not simplifying a messy reality. You are reporting a number that is wrong in a specific, expensive direction.

The Approval Queue Nobody Drains

· 10 min read
Tian Pan
Software Engineer

You did the responsible thing. You looked at your agent, identified the actions that could cause real damage — issuing a refund, deleting a record, sending an external email, deploying a config change — and you routed them to a human for approval. Risk-tiered gating. Textbook. The review board signed off.

Then a customer escalation came in three weeks later: an agent task had been "in progress" since the previous Tuesday. Not failed. Not errored. Just sitting in a human approval queue that, it turned out, nobody was actually watching. The agent had done its job, parked the dangerous action behind a gate, and waited. The gate had no owner. The task aged silently in a place where no dashboard pointed and no alarm fired.

The Coding Interview That Agents Quietly Invalidated

· 10 min read
Tian Pan
Software Engineer

A two-hour take-home and a 45-minute algorithm round were never the point. They were proxies. The take-home stood in for "can this person ship a feature," and the whiteboard round stood in for "can this person decompose a problem under pressure." For two decades those proxies held up well enough that most teams stopped questioning them. They were cheap to administer, easy to grade, and roughly correlated with the thing you actually cared about.

Coding agents broke the correlation without breaking the format. The interview still runs. It still produces a score. The score still feels like signal. But the gap between what the interview measures and what the job requires has widened to the point where a green result certifies almost nothing — and most hiring pipelines have not noticed, because nothing visibly failed.

This is the quiet kind of invalidation. Not a process that collapsed, but a process that kept running after its assumptions stopped being true.

What a Coding Interview Measures When the Candidate Has an Agent

· 9 min read
Tian Pan
Software Engineer

The coding interview was built to isolate a single variable. Put a person in a room, give them a problem, take away their references, and watch whether they can turn the problem into working code by themselves. Everything about the format — the whiteboard, the blank editor, the prohibition on looking things up — exists to strip away collaborators and tools so you measure one isolated skill: can this person, alone, write correct code under pressure.

That skill is no longer the one the job exercises. Day-to-day engineering in 2026 is a collaboration between an engineer and an agent. The engineer decides what to build, the agent drafts most of the code, and the engineer's real work is reviewing, correcting, and deciding when the agent is confidently wrong. The interview measures solo code production. The job rewards directing a tireless, fast, occasionally hallucinating collaborator. The proxy and the target have come apart, and most hiring pipelines haven't noticed.

This is not a complaint about cheating, though cheating is the symptom everyone fixates on. It's a measurement problem. When you can no longer observe the variable your test was designed to isolate, the test stops producing signal — and a test that produces no signal while everyone still trusts it is worse than no test at all.

Context Length Is a Security Boundary, Not Just a Cost Line

· 9 min read
Tian Pan
Software Engineer

Most teams treat the context window as a budget. You have a million tokens; spend them wisely; longer conversations cost more and run slower. That framing is correct and incomplete. The context window is also an attack surface, and its size is a dial that quietly weakens your safety controls as it turns up.

Here is the failure mode nobody puts in the threat model. Your system prompt — the one with the guardrails, the tool-use rules, the "never do X" clauses — sits at the very top of the context. Its authority is strongest there. As a conversation runs, thousands of tokens of user turns, tool outputs, and retrieved documents pile on top of it. The model's attention does not weigh all of those tokens equally. The instructions closest to the point of generation win ties. By turn forty, your guardrails are not gone, but they are buried, and a patient adversary does not need a clever jailbreak to get past them. They just need a conversation long enough.

This is not a hypothetical. It is a measurable property of how transformers attend to long contexts, and it has a name in the research literature even if it does not have one in your incident review template.

Dogfooding Is Not an Eval Strategy

· 9 min read
Tian Pan
Software Engineer

Every team building an AI product reaches the same comfortable conclusion: "We use it every day, and it works great." That sentence feels like evidence. It is not. It is the single most misleading signal in the room, and it gets stronger — more convincing, more wrong — the better your team is.

Dogfooding tells you the product runs. It does not tell you the product works. Those are different claims, and the gap between them is exactly where your launch goes sideways. The people who built the system are, statistically, the worst possible sample of the people who will use it. They share its mental model, they know its soft spots, and they have spent months training themselves to phrase requests the way the model likes. That is not a test population. That is a control group for a study you never ran.

Your Eval Set Is a Frozen Photograph of Traffic Your Users Already Left

· 10 min read
Tian Pan
Software Engineer

You shipped a model upgrade. The eval suite went from 87% to 91%. The release notes wrote themselves, leadership clapped, and then the dashboards that actually matter — user satisfaction, escalation rate, thumbs-down ratio — did nothing. Flat. Maybe slightly worse.

This is one of the most disorienting failure modes in AI engineering, because nothing is broken. The eval ran correctly. The numbers are real. The model genuinely improved on the 600 examples you tested it against. The problem is that those 600 examples are a photograph of traffic from the week you built the suite, and your users have spent the months since then walking out of frame.

The Fallback Model You Never Load-Tested

· 8 min read
Tian Pan
Software Engineer

Every resilient LLM design has a line in the config that names a secondary model. It is there because someone, during a design review, asked the right question — "what happens when the primary is down?" — and someone else answered it with a fallback: key. Everyone nodded. The architecture diagram got a second box with a dotted arrow. The compliance doc got a sentence about graceful degradation.

And then nobody touched it again.

The fallback model is the most confidently asserted, least exercised component in most production AI systems. It is named, documented, and diagrammed — and on the day it actually carries traffic, it is also the day it has its first encounter with a real request. You did not build a safety net. You built a second model with an unknown breaking strain, and you will discover that strain at the worst possible moment.

Your Fallback Path Is the Only Untested Code in Production

· 9 min read
Tian Pan
Software Engineer

Every serious AI system ships with a fallback. When the primary model is rate-limited, route to a cheaper one. When the provider returns 5xx, serve a cached answer. When confidence drops below a threshold, fall back to a hand-written heuristic. The architecture diagram has a clean little branch labeled "degraded mode," and everyone feels safer for it.

Here is the uncomfortable part. That branch is the only code in your system that almost never runs. The primary path executes millions of times a day and gets debugged, profiled, and battle-tested by sheer traffic volume. The fallback executes approximately never — until the day it executes for everyone at once, under load, during an incident, while three engineers watch a dashboard turn red.

A fallback you do not exercise is not redundancy. It is a second, unmonitored system whose debut is statistically guaranteed to happen at the worst possible moment.

The Happy Path Is the Only Path Your Agent Eval Ever Tested

· 10 min read
Tian Pan
Software Engineer

Look at where most agent eval sets come from. Someone builds the agent, demos it to the team, the demo works, and the demo script becomes the eval suite. The cases that pass review are the cases someone already watched pass. The eval set is, almost by construction, a recording of the happy path — the one tool sequence that worked the day the screenshot was taken.

So when the dashboard says the agent scores 94%, what it actually says is: it passes the cases we imagined. It says nothing about the case where the search API returns a 429 in the middle of a multi-step plan, where the user contradicts a constraint they stated two turns ago, or where retrieval comes back empty and the agent has to decide between guessing and admitting it doesn't know. Those cases aren't failing your eval. They were never in it.

This is golden-path bias, and it is the default shape of an agent eval suite unless you fight it deliberately. The fix is not more cases. It is different cases — chosen by failure mode, harvested from production, and stress-tested with deliberate faults.

Your Internal API Became a Public API the Day an Agent Called It

· 10 min read
Tian Pan
Software Engineer

Internal APIs survive on a quiet arrangement: nobody writes the contract down because everybody already knows it. The fields that happen to be there, the error you throw that a caller secretly parses, the endpoint that returns 200 with an empty list instead of 404 — these are load-bearing behaviors held together by the fact that you can name every caller and Slack them before you change anything. That arrangement works right up until it doesn't.

It stops working the day you wire an agent to that API. Not because the agent is malicious or careless, but because the agent is a caller you cannot reach. It has no Slack handle. It did not read your migration note. It depends on response shapes it absorbed from an example payload or a schema snapshot, and it will keep depending on them long after you've moved on.

The uncomfortable truth is that "internal" was never a property of the API. It was a property of the caller list. Shorten that list to people you know and the API is internal; add one participant you can't coordinate with and the API is public — with all the discipline that word implies, and none of the infrastructure you'd have built if you'd known.