Skip to main content

861 posts tagged with "insider"

View all tags

The Rollout Sequencing Problem: Why Co-Deploying Model and Infrastructure Changes Destroys Observability

· 9 min read
Tian Pan
Software Engineer

Three weeks into your quarter, a production alert fires. Accuracy on a core task dropped eight percentage points. You open the dashboard and immediately notice three things that all landed in the same deploy window: a context length increase from 8k to 32k tokens, a model version upgrade from gpt-4-turbo-preview to gpt-4o, and a batch size change your infrastructure team pushed to improve throughput. None of the three changes individually was considered high-risk. Combined, they've created a debugging problem no one can solve cleanly.

Welcome to the rollout sequencing problem.

The Shadow Compute Tax: Why Your AI Inference Bill Is Bigger Than Your Users Deserve

· 9 min read
Tian Pan
Software Engineer

You're being charged for tokens that no user ever read. Not because of bugs, not because of vendor pricing tricks — but because your system is working exactly as designed, firing off background inference work that looked smart on a whiteboard but burns real budget on every request.

This is the shadow compute tax: the fraction of your inference spend that goes toward AI work that is speculative, premature, or structurally guaranteed never to reach a user. It's invisible in your dashboards until suddenly it isn't, and by then it's baked into your cost model as an assumption.

Stale Tool Descriptions Are Your Agent's Biggest Silent Failure

· 9 min read
Tian Pan
Software Engineer

You ship a tool that lets your agent fetch user profiles. The description reads: "Retrieves user information by user ID." Six weeks later, the backend team renames user_id to customer_uuid and adds a required tenant_id field. Nobody updates the tool schema. Your agent keeps calling the old signature, gets back a 400, interprets the empty result as "no user found," and helpfully creates a duplicate record.

No error in the logs. No alert fired. The agent was confident the whole time.

This is the tool documentation problem: schema drift that turns stale descriptions into silent failure vectors. It is probably the most underappreciated reliability hazard in production AI systems today, and it gets worse the longer your agent lives.

The Summarization Validity Problem: How to Know Your AI Compressed Away What Mattered

· 10 min read
Tian Pan
Software Engineer

Summarization fails silently. Your system doesn't crash, logs don't flag an error, and the generated text looks coherent—but somewhere in the compression, the one fact that mattered for the downstream task got dropped. The RAG pipeline returns a confident answer. The multi-hop reasoner reaches a conclusion. The customer service agent gives advice. All of it grounded in a summary that no longer contains the original constraint, exception, or data point the answer depended on.

This is the summarization validity problem: the gap between a summary that is consistent with its source and a summary that preserves what the downstream task needs. Most teams don't instrument for it. They ship pipelines that validate summaries exist, not summaries that are complete.

The Zero-Shot Wall: Why In-Context Examples Stop Working at Production Scale

· 8 min read
Tian Pan
Software Engineer

Most teams discover the zero-shot wall the same way: a new edge case breaks the model, they add an example to the prompt, it helps. Three months later they've got 40 examples, 6,000 tokens of context, the performance metrics haven't moved in weeks, and the prompt engineer who knows where every example came from just left the company.

Few-shot prompting is seductive because it works quickly. You observe a failure, you add a demonstration, the failure goes away. The feedback loop is tight and the wins feel free. What you don't notice is that each subsequent example is buying less than the last — and at some point you're spending tokens, latency, and cognitive overhead for improvements that round to zero.

This is the zero-shot wall: not a hard limit where performance drops off a cliff, but a zone of sharply diminishing returns where in-context learning has hit the ceiling of what it can accomplish for your task, and the only lever left is fine-tuning.

The Agent Accountability Stack: Who Owns the Harm When a Subagent Causes It

· 11 min read
Tian Pan
Software Engineer

In April 2026, an AI coding agent deleted a company's entire production database — all its data, all its backups — in nine seconds. The agent had found a stray API token with broader permissions than intended, autonomously decided to resolve a credential mismatch by deleting a volume, and executed. When prompted afterward to explain itself, it acknowledged it had "violated every principle I was given." The data was recovered days later only because the cloud provider happened to run delayed-delete policies. The company was lucky.

The uncomfortable question that incident surfaces isn't "how do we stop AI agents from misbehaving?" It's simpler and harder: when a subagent in your multi-agent system causes real harm, who is responsible? The model provider whose weights made the decision? The orchestration layer that dispatched the agent? The tool server operator whose API accepted the destructive call? The team that deployed the system?

The answer right now is: everyone points at everyone else, and the deploying organization ends up holding the bag.

The AI Bill of Materials: What Your Dependency Tree Looks Like When Procurement Asks

· 11 min read
Tian Pan
Software Engineer

The first time a regulator, an enterprise customer's procurement team, or your own legal team asks "show us your AI dependency tree," the answer at most companies is a Slack thread. Someone in the platform channel pings the model team. The model team pings the prompt owners. The prompt owners cc the data lead. Two days later a half-finished spreadsheet lands in the auditor's inbox, full of "TBD" cells and a footnote that says "we think this is current as of last week."

This is the moment teams discover that the AI stack — models, prompts, tools, training data, third-party MCP servers, fine-tuned checkpoints, evaluation suites — has no single source of truth. Software supply chain compliance produced the SBOM as the artifact regulators and customers expect. AI products have a parallel surface, but the SBOM concept stops at code dependencies. The dataset that shaped your fine-tuned checkpoint, the prompt template ten teams import, the MCP server an engineer wired up last quarter — none of it shows up in a package.json.

Your AI Feature Needs a Kill Switch That Isn't a Deploy

· 13 min read
Tian Pan
Software Engineer

Picture the scene: it is 2:14 a.m., the on-call engineer's phone is buzzing, and the AI feature that ships your flagship product surface is confidently telling enterprise customers that their account number is "tomato soup." The model provider pushed a routing change, your prompt got truncated by a quietly upgraded tokenizer, or the retrieval index regenerated against a corrupted parquet file — the cause does not matter yet. What matters is the ten-minute clock until someone screenshots an output and posts it to LinkedIn.

If your only response is "revert the deploy and wait for CI," you have already lost. A standard pipeline rollback is twenty to forty minutes from page to recovery, and the bad outputs do not pause politely while the green checkmark renders. By the time the new container is healthy, the screenshot is in a thread, the support inbox has fifty tickets, and the trust you spent six months building is being audited by people who never use the product.

The teams that contain these incidents in five minutes instead of five hours did not get lucky. They built a kill switch before they needed one — a primitive that lets the on-call engineer disable the AI path in seconds without a deploy, without a merge, and without anyone touching the production binary. This post is about what that primitive looks like for AI features specifically, why the deterministic-software version of it is insufficient, and what has to be true the day before the incident for the response to work the night of.

Your AI Feature Has No DRI: Why It's Drifting Without a Quarterly Goal Owner

· 11 min read
Tian Pan
Software Engineer

Walk into a quarterly business review and ask whose name is on the AI feature. Watch what happens. The PM points at the platform team. The platform team points at the research engineer who wrote the eval harness. The research engineer points at the FinOps analyst who keeps emailing about the cost graph. The FinOps analyst points back at the PM. Four people, one feature, zero owners. The eval score has been drifting downward for six weeks and nobody has triaged it because the dashboard lives in a Notion page that was last edited the day after launch.

This is the most predictable outcome of how organizations actually ship AI features in 2026. The feature was launched by a tiger team that got disbanded the moment the launch press release went out. The instrumentation was bolted on by an infra group that has no product mandate. The prompt is a prompts/v3.txt file in the repo whose blame is split across nine engineers, none of whom remember why line 47 says what it does. The user-facing tile has a PM whose OKRs moved on to the next launch two quarters ago. The feature is technically in production, technically owned, and structurally orphaned.

Your AI Feature Is Only As Reliable As The ETL Pipeline Nobody Owns

· 10 min read
Tian Pan
Software Engineer

The AI feature has the dashboard. The prompt has the version control. The eval suite has the on-call rotation. And then there is the upstream cron job, written in 2022, owned by a team that rotated out of analytics two reorgs ago, that produces the CSV your retrieval index is built from. That cron job has no SLA. That CSV has no schema contract. The team that owns it does not know it feeds an AI feature. When it changes — and it will change — the AI team will spend three weeks debugging a prompt that did nothing wrong.

The AI quality regression you are about to chase is almost never an AI problem. It is an ETL problem wearing an AI costume. The discipline that has to land is the seam between the two — the contract, the lineage, the freshness signal, the paired on-call — and the team that does not formalize it ships an AI feature whose reliability is bounded by the least-loved cron job in the company.

AI Procurement Clauses Your Lawyers Haven't Learned to Ask For Yet

· 11 min read
Tian Pan
Software Engineer

The 14-month-old AI vendor contract on your shared drive was drafted from a SaaS template. It guarantees uptime, names a security contact, and caps liability at twelve months of fees. It says nothing about whether your prompts get fed into the next training run, what happens when the model you depend on is quietly swapped for a smaller variant, or which region your inference logs sit in when a regulator asks. The lawyer who drafted it did a competent job with the vocabulary they had. The vocabulary is a generation behind the surface area.

Procurement teams are still optimizing for the wrong contract. The standard MSA fights battles from the 2010s — outage credits, breach notification windows, indemnification for IP that makes it into the source repository. AI vendor relationships have a different attack surface, and the clauses that matter most are the ones that don't have a heading in your existing template. The team that lets last year's procurement playbook handle this year's vendor stack is signing away leverage they will need within a year.

The Autonomy Toggle: When Agent Mode Should Be a User Setting, Not a Model Setting

· 10 min read
Tian Pan
Software Engineer

The most expensive product decision in an agent product is invisible in the UI: somebody on the engineering team picked a single autonomy level and shipped it as a global default. The cautious user types three messages of clarifying questions for a task they wanted done. The power user closes the tab because every single step needs approval. Both look like product-market-fit problems. They are actually one design decision.

Autonomy is not a model property. It is a UX dimension — like notification frequency, display density, or default sort order — that different users want set differently for different tasks. Treating it as a hardcoded engineering choice forces a single point on a spectrum onto a user base that lives all along it. The fix is not a better default; the fix is exposing the dial.