Skip to main content

578 posts tagged with "insider"

View all tags

Tokens Are a Finite Resource: A Budget Allocation Framework for Complex Agents

· 10 min read
Tian Pan
Software Engineer

The frontier models now advertise context windows of 200K, 1M, even 2M tokens. Engineering teams treat this as a solved problem and move on. The number is large, surely we'll never hit it.

Then, six hours into an autonomous research task, the agent starts hallucinating file paths it edited three hours ago. A coding agent confidently opens a function it deleted in turn four. A document analysis pipeline begins contradicting conclusions it drew from the same document earlier in the session. These are not model failures. They are context budget failures — predictable, measurable, and almost entirely preventable if you treat the context window as the scarce compute resource it actually is.

Agent Fleet Observability: Monitoring 1,000 Concurrent Agent Runs Without Dashboard Blindness

· 12 min read
Tian Pan
Software Engineer

Running a hundred agents in production feels manageable. You have traces, you have dashboards, you know when something breaks. Running a thousand concurrent agent runs is a different problem entirely — not because the agents are more complex, but because the monitoring model you built for ten agents silently stops working long before you notice.

The failure mode is subtle. Everything looks fine. Your span trees are there. Your error rates are low. And then a prompt regression that degraded output quality for 40% of sessions for six hours shows up only because a customer complained — not because your observability stack caught it.

This is the dashboard blindness problem: per-agent tracing works beautifully at small scale and fails quietly at fleet scale. Here is why it happens and what to do instead.

Your Agent Traces Are Lying: Cardinality, Sampling, and Span Hierarchies for LLM Agents

· 11 min read
Tian Pan
Software Engineer

Your tracing dashboard says the agent made eight calls to serve a user request. In reality, it made forty-seven. Your head-based sampler quietly dropped most of them. The ones you kept are technically correct but causally useless — child spans orphaned from a root their parent sampler threw away.

This is not a visualization bug. It is the predictable outcome of pointing distributed tracing infrastructure designed for ten-span HTTP fan-outs at systems that generate hundreds of spans per user turn. Default OpenTelemetry configurations systematically undercount the work agents do, and the teams running those agents usually do not notice until a customer complains about latency the trace viewer says does not exist.

Agentic Task Complexity Estimation: Budget Tokens Before You Execute

· 10 min read
Tian Pan
Software Engineer

Two agents receive the same user message. One finishes in 3 seconds and 400 tokens. The other enters a Reflexion loop, burns through 40,000 tokens, hits the context limit mid-task, and produces a half-finished answer. Neither the agent nor the calling system predicted which outcome was coming. This is not an edge case — it is the default behavior when agents start tasks without any model of how deep the work will go.

LLM-based agents have no native sense of task scope before execution. A request that reads as simple in natural language might require a dozen tool calls and multiple planning cycles; a complex-sounding request might resolve in a single lookup. Without pre-execution complexity estimation, agents commit resources blindly: the context window fills quadratically as turn history accumulates, planning overhead dominates execution time, and by the time the system detects a problem, the early decisions that caused it are irreversible.

When Your AI Agent Consumes from Kafka: The Design Assumptions That Break

· 11 min read
Tian Pan
Software Engineer

The standard mental model for AI agents assumes HTTP: a client sends a request, the agent processes it, returns a response. Clean, synchronous, easy to reason about. When an LLM-powered function fails, you get an error code. When it succeeds, you move on.

Once you swap that HTTP interface for a Kafka topic or SQS queue, every one of those assumptions starts to crack. The queue guarantees at-least-once delivery. Your agent is stochastic. That combination produces failure modes that don't exist in deterministic systems—and the fixes aren't the same ones that work for traditional microservices.

This post covers what actually changes when AI agents consume from message queues: idempotency, ordering, backpressure, dead-letter handling, and the specific failure mode where a replayed message triggers different agent behavior the second time around.

AI Feature Decommissioning Forensics: What Dead Features Teach That Successful Ones Cannot

· 11 min read
Tian Pan
Software Engineer

Here's an uncomfortable pattern: the AI feature your team is about to launch next quarter already died at your company two years ago. It shipped under a different name, with a different prompt, solving a vaguely different problem, and it got quietly decommissioned after six months of flat adoption. Nobody wrote it up. Nobody connected the dots. The leading indicators that would have saved this cycle were sitting in dashboards that got archived along with the feature.

Most engineering orgs are elaborate machines for remembering successes. Launches get retrospectives, blog posts, internal celebrations. The features that got killed — the ones with 12% weekly active users despite a polished demo, the ones whose unit economics inverted when token costs compounded across a longer-than-expected tool chain, the ones users learned to trust, lost trust in, and then routed around — generate almost no institutional memory. And the failure patterns embedded in those deaths are exactly the ones your planning process has no way to price in.

The AI Incident Severity Taxonomy: When Is a Hallucination a Sev-0?

· 11 min read
Tian Pan
Software Engineer

A legal team's AI-powered research assistant fabricated three case citations and slipped them into a court filing. The citations looked plausible — real courts, real-sounding case names, coherent holdings. Nobody caught them before the brief was submitted. The incident cost the firm an emergency hearing, a public apology, and a bar inquiry.

Was that a sev-0? A sev-2? The answer depends on which framework you use — and traditional severity models will give you the wrong answer almost every time.

Software incident severity classification was built for deterministic systems. A service is either responding or it isn't. A database query either succeeds or throws an error. The failure modes are binary, the blame is traceable to a commit, and the fix is a rollback or a patch. AI systems break all three of those assumptions simultaneously, and organizations that apply traditional severity frameworks to LLM failures end up either panicking over noise or dismissing structural failures as one-off quirks.

AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts

· 11 min read
Tian Pan
Software Engineer

The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.

This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).

The AI Reliability Floor: Why 80% Accurate Is Worse Than No AI at All

· 9 min read
Tian Pan
Software Engineer

Most teams measure AI feature quality by asking "how often is it right?" The more useful question is "how often does being wrong destroy trust faster than being right builds it?" These questions have different answers — and only the second one tells you whether to ship.

There is a reliability floor below which an AI feature does more damage than no feature at all. Below it, users learn to distrust the AI after enough errors, and that distrust generalizes: they stop trusting the feature when it is correct, they route around it, and eventually they stop using it entirely. At that point, you have not shipped a partially-useful product; you have shipped a conversion and retention hazard disguised as a feature.

The AI Procurement Gap: Why Your Vendor Evaluation Process Can't Handle Probabilistic Systems

· 11 min read
Tian Pan
Software Engineer

A procurement team I worked with spent eleven weeks scoring four LLM vendors against a 312-row RFP spreadsheet. They negotiated 99.9% uptime, $0.0008 per 1K input tokens, SOC 2 Type II, and a glossy benchmark PDF that put their selected vendor 2.3 points ahead on MMLU. The contract was signed on a Friday. The following Tuesday, the vendor silently rolled a model update, and the customer-support agent the team had built started routing roughly 14% of refund requests to the wrong queue. The uptime SLA was honored. The benchmark scores were unchanged. The procurement process had functioned exactly as designed, and the system was still broken.

This is the AI procurement gap. The instruments enterprise procurement uses to manage software risk — feature checklists, uptime guarantees, security questionnaires, sample benchmarks — were built for systems whose outputs are reproducible. None of those instruments measure the thing that actually determines whether an AI vendor will keep working for you: the behavioral stability of a stochastic surface that the vendor controls and you do not.

Backpressure for LLM Pipelines: Queue Theory Applied to Token-Based Services

· 11 min read
Tian Pan
Software Engineer

A retry storm at 3 a.m. usually starts the same way: a brief provider hiccup pushes a few requests over the rate limit, your client library retries them, those retries land on a still-recovering endpoint, more requests fail, and within ninety seconds your queue depth has gone vertical while your provider dashboard shows you sitting at 100% of your tokens-per-minute quota with a backlog measured in five-figure dollars. The post-mortem will say "thundering herd." The honest answer is that you built a fixed-throughput retry policy on top of a variable-capacity downstream and forgot that queue theory has opinions about that.

Most of the well-known service resilience patterns were written for downstreams whose throughput is a wall: a database with a connection pool, a microservice with a known concurrency limit. LLM providers are not that. Your effective throughput is a moving target shaped by your tier, the model you picked, the size of the prompt, the size of the response, the time of day, and whether someone else on the same provider is fine-tuning a frontier model right now. Treating it like a fixed pipe is the root cause of most of the LLM outages I've seen this year.

The Bias Audit You Keep Skipping: Engineering Demographic Fairness into Your LLM Pipeline

· 10 min read
Tian Pan
Software Engineer

A team ships an LLM-powered feature. It clears the safety filter. It passes the accuracy eval. Users complain. Six months later, a researcher runs a 3-million-comparison study and finds the system selected white-associated names 85% of the time and Black-associated names 9% of the time — on identical inputs.

This is not a safety problem. It's a fairness problem, and the two require entirely different engineering responses. Safety filters guard against harm. Fairness checks measure whether your system produces equally good outputs for everyone. A model can satisfy every content policy you have and still diagnose Black patients at higher mortality risk than equally sick white patients, or generate thinner resumes for women than men. These disparities are invisible to the guardrail that blocked a slur.

Most teams never build the second check. This post is about why you should and exactly how to do it.