Skip to main content

578 posts tagged with "insider"

View all tags

GPU Starvation: How One Tenant's Reasoning Prompt Stalls Your Shared Inference Endpoint

· 9 min read
Tian Pan
Software Engineer

Your dashboard says the GPU is healthy. Utilization hovers around 80%, throughput in tokens-per-second looks fine, cold starts are rare, and the model is the one you asked for. Yet your pager is going off because p99 latency has tripled, a handful of users are timing out, and support tickets all describe the same thing: "the app froze for twenty seconds, then came back." You pull a trace and find an unrelated customer's 28,000-token reasoning request sitting in the same batch as every stalled call. One tenant's deep-think prompt just ate everyone else's turn.

This is head-of-line blocking, and it is the failure mode that ruins shared LLM inference the moment reasoning models enter the traffic mix. The pattern is not new — storage systems and network stacks have fought it for decades — but it takes a specific shape on GPUs because of how continuous batching and KV-cache pinning work. Most teams design for average load and discover too late that "shared inference is cheaper" stops being true the instant request sizes stop being similar.

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

· 11 min read
Tian Pan
Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.

Inference Is Faster Than Your Database Now

· 10 min read
Tian Pan
Software Engineer

Open any 2024-era AI feature's trace and the model call is the whale. Eight hundred milliseconds of generation surrounded by a thin crust of retrieval, auth, and a database lookup rounding to nothing. Every architecture decision that year — the caching, the prefetching, the streaming UX — was designed around hiding that whale.

Now pull the same trace for the same feature running on a 2026 inference stack. The whale is a dolphin. A cached prefill returns the first token in 180ms. Decode streams at 120 tokens per second. The model is no longer the slow node. Your own infrastructure is, and most of it hasn't noticed.

This reordering is the most important performance shift of the year, and it's the one teams keep under-reacting to. The p99 floor on an AI request is now set by the feature store call, the auth middleware, and the Postgres lookup that was always that slow — nobody just cared when the model was taking nine-tenths of the budget.

Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency

· 8 min read
Tian Pan
Software Engineer

Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.

The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.

Markdown Beats JSON: The Output Format Tax You're Paying Without Measuring

· 11 min read
Tian Pan
Software Engineer

Most teams flip JSON mode on the day they ship and never measure what it costs them. The assumption is reasonable: structured output is a correctness win, so why wouldn't you take it? The answer is that strict JSON-mode constrained decoding routinely shaves 5–15% off reasoning accuracy on math, symbolic, and multi-step analysis tasks, and nobody notices because the evals were run before the format flag was flipped — or the evals measure parseability, not quality.

The output format is a decoding-time constraint, and like every constraint it warps the model's probability distribution. The warp is invisible when you look at logs: the JSON is valid, the schema matches, the field types line up. What you cannot see in the logs is the reasoning that the model would have produced in prose but could not fit inside the grammar you gave it. The format tax is real, well-documented in the literature, and almost universally unmeasured in production.

This post is about when to pay it, how to stop paying it when you don't have to, and what a format-choice decision tree actually looks like for engineers who want structured output and accuracy at the same time.

Mid-Flight Steering: Redirecting a Long-Running Agent Without Killing the Run

· 10 min read
Tian Pan
Software Engineer

Watch a developer use an agentic IDE for twenty minutes and you will see the same micro-drama play out three times. The agent starts a long task. Two tool calls in, the user realizes they want a functional component instead of a class, or a v2 endpoint instead of v1, or tests written in Vitest instead of Jest. They have exactly one lever: the red stop button. They press it. The agent dies mid-edit. They copy-paste the last prompt, append the correction, and pay for the first eight minutes of work twice.

The abort button is the wrong affordance. It treats "I want to adjust the plan" and "I want to throw away the run" as the same gesture. In practice they are as different as a steering wheel and an ejector seat, and conflating them is why so many agent products feel brittle the moment a task takes longer than a single screen of output.

The Model-of-the-Week Roadmap: When Vendor Promises Become Committed Dependencies

· 9 min read
Tian Pan
Software Engineer

A product manager pulls up the next-quarter roadmap. Three features are marked "depends on next-gen model." Nobody asks what happens if next-gen slips, arrives 20% smaller than the demo suggested, or ships gated behind an enterprise tier your customers do not qualify for. Six months later, all three of those scenarios have happened, and the team is now rebuilding two quarters of architecture against the model that actually shipped — a different shape from the one they planned for.

This is the model-of-the-week roadmap: treating unreleased capability claims as committed dependencies. It is one of the most reliable ways to turn a twelve-month plan into a thirty-month plan, and it rarely looks risky in the moment because every vendor demo feels inevitable. The schedule damage is invisible until the slip compounds.

Multi-Model Reliability Is Not 2x: The Non-Linear Cost of a Second LLM Provider

· 13 min read
Tian Pan
Software Engineer

The naive calculation goes like this. Our primary provider has 99.3% uptime. Add a second provider with similar independence, and simultaneous failure drops to roughly 0.005%. Multiply cost by two, divide risk by two hundred. Engineering leadership signs off on the 2x budget and the oncall rotation stops paging on provider outages. The spreadsheet says this is the best reliability investment on the roadmap.

Six months later the spreadsheet is wrong. The eval suite takes 3x as long to run, prompt changes need two PRs, the weekly regression report has two columns that disagree with each other, and nobody can remember which provider the staging fallback is currently routing to. The 2x budget is closer to 4–5x once the team tallies the human hours spent keeping both paths calibrated. The second provider is still technically serving traffic, but half the features have been quietly pinned to one side because keeping both in sync stopped being worth it.

This is the multi-model cost trap. The reliability math is correct; the operational math is the part teams get wrong. What follows is the cost decomposition of going multi-provider, the single-provider-with-degraded-mode option most teams should try first, and the narrow set of criteria that actually justify the nonlinear complexity.

The Orphan Adapter Problem: When Your Fine-Tune Outlives Its Base Model

· 12 min read
Tian Pan
Software Engineer

A senior engineer left six months ago. She owned the classifier adapter that routes customer support tickets — a 32-rank LoRA trained on 847 hand-labeled examples, pinned to a base model that hits end-of-life in 43 days. Nobody remembers why those 847 examples were chosen over the 2,000 they started with. The training data sits in an S3 bucket whose lifecycle policy purges objects older than one year. Her laptop was wiped. The fine-tuning notebook has a cell that calls a preprocessing function she imported from her personal dotfiles repo, now private.

This is the orphan adapter — a fine-tune that outlived its maintainers, outlived its data, and is about to outlive the base model it was trained on. It sits in your production stack, routing real user traffic, and nobody left on the team can rebuild it. The deprecation email didn't create this crisis. It just exposed it.

The Output Commitment Problem: Why Streaming Self-Correction Destroys User Trust More Than the Original Error

· 10 min read
Tian Pan
Software Engineer

A user asks your agent a question. Tokens start flowing. Three sentences in, the model writes "Actually, let me reconsider — " and pivots to a different answer. The revised answer is better. The user closes the tab.

This is the output commitment problem, and it is one of the most consistently underestimated UX failures in shipped AI products. The engineering mindset treats self-correction as a feature — the model noticed its own error, that is the system working as intended. The user-perception mindset treats it as a disaster — the product demonstrated, live, that its first confident claim was wrong. Those two readings are both correct, and they do not reconcile on their own.

The core asymmetry is that streaming makes thinking legible, and legible thinking is auditable thinking. A model that hallucinated silently and then produced a clean final answer would look competent. The same model, streaming every half-thought, looks like it is flailing. The answer quality is identical. The perception is not.

Pattern-Matching Failures: When Your LLM Solves the Wrong Problem Fluently

· 11 min read
Tian Pan
Software Engineer

A user pastes a long, complicated bug report into your AI assistant. It looks like a classic null-pointer question, with the same phrasing and code layout as thousands of Stack Overflow posts. The model responds confidently, cites the usual fix, and sounds authoritative. The user thanks it. The bug is still there. The report was actually about a race condition; the null-pointer framing was incidental to how the user described the symptom.

This is the single hardest bug class to catch in a production LLM system. The model did not refuse. It did not hedge. It did not hallucinate a fake API. It solved the wrong problem, fluently, and everyone downstream — the user, your eval pipeline, your guardrails — saw a plausible on-topic answer and moved on. I call these pattern-matching failures: the model latched onto surface features of the query and produced a confident answer to something adjacent to what was actually asked.

Plan-and-Execute Is Marketing, Not Contract: Plan Adherence as a First-Class SLI

· 9 min read
Tian Pan
Software Engineer

The agent printed a five-step plan. Step three said "fetch the user's billing history from the invoices service." The trace shows step three actually called the orders service, joined a stale customer table, and produced a number that looked right. The output passed the eval. The post-mortem found the regression six weeks later, when finance noticed the dashboard had quietly diverged from source-of-truth by 4%.

Nobody wrote a bug. The planner wrote a contract the executor never signed.

This is the failure mode plan-and-execute architectures bury under their own architectural elegance. The pattern was sold as a way to give agents long-horizon coherence: a strong model drafts a plan, weaker models execute steps, the plan acts as a scaffold. In practice the plan is a marketing artifact — a plausible-looking story emitted at t=0, then promptly invalidated by every interesting thing that happens at t>0. The trace shows the plan. The trace shows the actions. Almost nobody is measuring the distance between them.