Skip to main content

578 posts tagged with "insider"

View all tags

Why Your AI Roadmap Shouldn't Have a 12-Month Plan

· 9 min read
Tian Pan
Software Engineer

A team I worked with last quarter spent six weeks building a "smart document classifier" — fine-tuned model, eval harness, custom UI, the whole production pipeline. It shipped on a Tuesday. The following Monday, a new general-purpose model dropped that beat their fine-tune on the same eval, zero-shot, with no infrastructure investment. Their entire Q2 OKR became a wrapper around a one-line API call. The roadmap had committed twelve months earlier to "owning the classification stack." That commitment was wrong before the ink dried.

This is not an isolated story. Industry trackers logged 255 model releases from major labs in Q1 2026 alone, with roughly three meaningful frontier launches per week through March. Costs have collapsed: API pricing is down 97% since GPT-3, and the gap between top providers has narrowed to within statistical noise on most benchmarks. When the underlying substrate changes this fast, a twelve-month feature roadmap is not a plan — it is a list of bets you cannot revisit, made with information that will be stale before you ship the second item.

The Air-Gapped LLM Blueprint: What Egress-Free Deployments Actually Need

· 11 min read
Tian Pan
Software Engineer

The cloud AI playbook assumes one primitive that nobody writes down: outbound HTTPS. Vendor APIs, hosted judges, telemetry pipelines, model registries, vector stores, dashboard SaaS, secret managers — every one of them quietly resolves to a domain on the public internet. Pull that one cable and the stack does not degrade gracefully. It collapses.

That is the moment most teams discover their architecture has an egress dependency they never accounted for. A "small" prompt update needs to call out to a hosted classifier. The eval suite hits an LLM judge over the wire. The observability agent phones home. The model registry pulls weights from a CDN. None of it is malicious, and none of it is unusual. It is just what the cloud-native stack looks like when you stop noticing the cable.

Your Embedding Model Choice Sets the Ceiling Your LLM Can't Raise

· 11 min read
Tian Pan
Software Engineer

A team I was advising had spent two months swapping LLMs in their RAG pipeline. Claude, GPT, Gemini, then back again. Each swap shaved a few percentage points off hallucination rate but never moved the needle on the metric that mattered: their support agents still couldn't find the right knowledge base article more than 60% of the time. They were tuning the wrong layer. The retriever was returning irrelevant chunks, and no amount of LLM cleverness can answer a question from documents the retriever never surfaced.

The embedding model is the part of a RAG system that decides what the LLM is even allowed to see. It draws the geometry of your corpus — which documents land near which queries in vector space. Once that geometry is wrong, the LLM is just a confident narrator of bad context. Swapping it for a smarter one usually makes the answers more articulate, not more correct.

Eval as a Pull Request Comment, Not a Job: Embedding LLM Quality Gates in Code Review

· 11 min read
Tian Pan
Software Engineer

Most teams that say "we have evals" mean: there is a dashboard, somebody runs the suite weekly, and the numbers get pasted into a Slack channel that nobody reads. Reviewers approve a prompt change without ever seeing whether it moved the suite, and the regression shows up two weeks later in a customer ticket. The eval exists; the eval is not in the loop.

The fix is structural, not motivational. Evals only gate quality when they live where the change lives — in the pull request comment, next to the diff, with a per-PR delta and a regression callout that the reviewer cannot scroll past. Anywhere else, they are a performative artifact: real work was done to build them, and they catch nothing.

Eval Set Rot: Why Your Score Trends Up While Users Trend Down

· 10 min read
Tian Pan
Software Engineer

The eval score has been trending up for two quarters. The dashboard is green, the regression suite has not flagged a real failure since March, and the team has gotten faster at shipping prompt changes because the eval gives crisp pass/fail answers. Meanwhile, user-reported quality is sliding. NPS is down four points, the support queue is full of failure modes nobody has labels for, and the head of product has started asking why the evals look great if customers are angry.

The eval set is not lying. It is answering the question it was built to answer, six months ago, against the traffic distribution that existed in launch week. The product has shifted. The user base has shifted. The long-tail use cases the team did not anticipate at launch now make up a third of traffic. The eval set is still measuring the world that existed in week one, and the team is averaging today's model against yesterday's product.

This is eval set rot. It is one of the quietest failure modes in modern AI engineering, and it gets worse as the eval set gets bigger, because the people maintaining it confuse "more cases" with "better coverage."

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

· 10 min read
Tian Pan
Software Engineer

The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.

This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.

If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.

Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring

· 10 min read
Tian Pan
Software Engineer

The first time a finance team asks an AI product team to forecast unit economics, the conversation goes the same way. The team pulls up the inference dashboard, points at the monthly token spend, and says "that's our COGS." The CFO multiplies by projected volume, draws a line on a chart, and asks where the gross margin curve crosses 70%. Six weeks later, when the actual P&L lands, the inference number on the dashboard is correct and the gross margin is twenty points lower than the forecast. Nobody is lying. Inference was just half of what the agent actually costs.

The other half is distributed across line items that nobody on the AI team owns. The vector database bill grows quietly because retrieval volume tracks usage and re-indexing costs are billed against compute, not storage. The observability platform's invoice arrives from the platform team's budget. Embedding regeneration shows up as a CI cost. Telemetry storage is filed under data warehouse. Human review is in customer-success headcount. None of these line items is alarming on its own — and that is exactly why the integrated number is the one that surprises everyone.

The Privacy Boundary No One Tests: Why 'Stateless' Tools Are the AI-Era IDOR

· 10 min read
Tian Pan
Software Engineer

A tool labeled "stateless" is a promise the runtime cannot keep. Behind the function signature sits a Redis cache, a vector index, an embedding store, a rate-limit table, a memoization layer, an LRU on the hot path — any one of which is a shared substrate where one user's data can land on another user's response. The function is stateless. The system is not. And in 2026, this is the most common privacy bug I see in agentic systems, because almost no one tests for it.

The shape of the bug is depressingly familiar to anyone who has worked on classic web apps. Insecure Direct Object Reference — IDOR — was the bread and butter of bug bounty for a decade: a request handler that accepts a record ID and returns the record without checking whether the caller is allowed to see it. The AI-era version is the same bug with a worse blast radius: a tool call that accepts a query and returns data without checking whether the caller's tenant owns that data. The query is in natural language. The cache key is a hash. The retrieval is approximate. None of those things absolve you of authorization, but each of them makes the bug harder to spot in code review.

Why Your Prompt Library Should Be a Monorepo, Not a Cookbook

· 11 min read
Tian Pan
Software Engineer

A team I worked with recently had three different "summarize this contract" prompts. One lived in a Notion page that the legal-tech squad copy-pasted into their service. One lived in a prompts/ folder in the customer-success backend, slightly modified to handle their tone preferences. One lived inline in a Python file inside the data team's notebook, hardcoded between two f-string interpolations. When OpenAI deprecated the model they all ran on, the migration plan involved Slack archaeology — each owner had to be tracked down, each variant had to be re-evaluated, and two of the three subtly broke in production for a week before anyone noticed.

This is what a prompt cookbook looks like at scale. Cookbooks make sense for ten prompts and one team. They become unmanageable somewhere around a hundred prompts and four teams. By the time you're running an AI organization, your prompts/ folder of .md files behaves exactly like vendored copy-paste code from 2008: every consumer has its own snapshot, drift is invisible, and breaking changes ripple outward in unpredictable ways.

Tool Call Ordering Is a Partial Order, Not a Set

· 10 min read
Tian Pan
Software Engineer

A "create then notify" sequence works in dev. A "notify then create" sequence emits a webhook for an entity that doesn't exist yet, the consumer 404s, and your team spends a week debugging what looks like a flaky integration test. The flake isn't flaky. It's deterministic given a hidden ordering invariant your tool set has and your planner doesn't know about.

This is the shape of most tool-call-ordering bugs in production agents: a tool set that secretly composes as a partial order — some operations must happen before others, others can run in any order — being treated by the planner as an unordered set of capabilities. The model picks an order that worked yesterday. A prompt edit, a model upgrade, or even a different temperature sample picks a different order tomorrow. Both look reasonable to anyone reading the trace. Only one is correct.

The team that doesn't declare the order is shipping a bug surface that the model's prompt sensitivity will eventually find.

The Semver Lie: Why a Minor LLM Update Breaks Production More Reliably Than a Major Refactor

· 11 min read
Tian Pan
Software Engineer

There is a quiet myth in AI engineering that goes like this: a "minor" model bump — claude-x.6 to claude-x.7, gpt-y.0 to gpt-y.1, the patch-level snapshot rolling forward by a date — should be a drop-in upgrade. The provider releases notes that talk about improved reasoning, lower latency, better tool use. The version number ticks gently. Nothing about the change reads as breaking.

Then it ships. And the on-call channel lights up with reports that the summarizer is now adding a paragraph that wasn't there before, that the JSON extractor is escaping unicode it used to leave alone, that the agent loop is now hitting the max-step ceiling on tasks that used to terminate in three calls. The eval scores look fine in aggregate; the user-visible feature is subtly wrong.

Agent Disaster Recovery: When Working Memory Dies With the Region

· 12 min read
Tian Pan
Software Engineer

The DR runbook your team rehearses every quarter was written for a stack you no longer fully run. It says: promote the replica, repoint DNS, drain the queue. It assumes state lives in databases, queues, and object storage — places the SRE org has owned, named, and tested for a decade. Then last quarter you shipped an agent. Working memory now lives in the inference provider's session cache, scratchpad files on a worker's local disk, in-flight tool results that haven't been written back, and a partial plan-and-act trace that exists only in the prompt history of one model call. None of that is on the asset register. None of it is in the runbook.

When the region drops, the agent doesn't fail cleanly. It half-completes. The user sees a workflow that started but the failover region cannot resume, the customer's invoice gets sent twice or not at all because the idempotency key lived on the dead worker, and the on-call engineer reads a Slack thread that begins "the orchestrator is up, but..." and ends six hours later with a credit-card chargeback queue.

This is the gap nobody named: agentic features have a state model the existing DR plan doesn't describe. The team that hasn't written that state surface down is one regional outage away from learning what their runbook's silence costs.