Skip to main content

578 posts tagged with "insider"

View all tags

Your AI Feature Needs a Kill Switch That Isn't a Deploy

· 13 min read
Tian Pan
Software Engineer

Picture the scene: it is 2:14 a.m., the on-call engineer's phone is buzzing, and the AI feature that ships your flagship product surface is confidently telling enterprise customers that their account number is "tomato soup." The model provider pushed a routing change, your prompt got truncated by a quietly upgraded tokenizer, or the retrieval index regenerated against a corrupted parquet file — the cause does not matter yet. What matters is the ten-minute clock until someone screenshots an output and posts it to LinkedIn.

If your only response is "revert the deploy and wait for CI," you have already lost. A standard pipeline rollback is twenty to forty minutes from page to recovery, and the bad outputs do not pause politely while the green checkmark renders. By the time the new container is healthy, the screenshot is in a thread, the support inbox has fifty tickets, and the trust you spent six months building is being audited by people who never use the product.

The teams that contain these incidents in five minutes instead of five hours did not get lucky. They built a kill switch before they needed one — a primitive that lets the on-call engineer disable the AI path in seconds without a deploy, without a merge, and without anyone touching the production binary. This post is about what that primitive looks like for AI features specifically, why the deterministic-software version of it is insufficient, and what has to be true the day before the incident for the response to work the night of.

Your AI Feature Has No DRI: Why It's Drifting Without a Quarterly Goal Owner

· 11 min read
Tian Pan
Software Engineer

Walk into a quarterly business review and ask whose name is on the AI feature. Watch what happens. The PM points at the platform team. The platform team points at the research engineer who wrote the eval harness. The research engineer points at the FinOps analyst who keeps emailing about the cost graph. The FinOps analyst points back at the PM. Four people, one feature, zero owners. The eval score has been drifting downward for six weeks and nobody has triaged it because the dashboard lives in a Notion page that was last edited the day after launch.

This is the most predictable outcome of how organizations actually ship AI features in 2026. The feature was launched by a tiger team that got disbanded the moment the launch press release went out. The instrumentation was bolted on by an infra group that has no product mandate. The prompt is a prompts/v3.txt file in the repo whose blame is split across nine engineers, none of whom remember why line 47 says what it does. The user-facing tile has a PM whose OKRs moved on to the next launch two quarters ago. The feature is technically in production, technically owned, and structurally orphaned.

Your AI Feature Is Only As Reliable As The ETL Pipeline Nobody Owns

· 10 min read
Tian Pan
Software Engineer

The AI feature has the dashboard. The prompt has the version control. The eval suite has the on-call rotation. And then there is the upstream cron job, written in 2022, owned by a team that rotated out of analytics two reorgs ago, that produces the CSV your retrieval index is built from. That cron job has no SLA. That CSV has no schema contract. The team that owns it does not know it feeds an AI feature. When it changes — and it will change — the AI team will spend three weeks debugging a prompt that did nothing wrong.

The AI quality regression you are about to chase is almost never an AI problem. It is an ETL problem wearing an AI costume. The discipline that has to land is the seam between the two — the contract, the lineage, the freshness signal, the paired on-call — and the team that does not formalize it ships an AI feature whose reliability is bounded by the least-loved cron job in the company.

AI Procurement Clauses Your Lawyers Haven't Learned to Ask For Yet

· 11 min read
Tian Pan
Software Engineer

The 14-month-old AI vendor contract on your shared drive was drafted from a SaaS template. It guarantees uptime, names a security contact, and caps liability at twelve months of fees. It says nothing about whether your prompts get fed into the next training run, what happens when the model you depend on is quietly swapped for a smaller variant, or which region your inference logs sit in when a regulator asks. The lawyer who drafted it did a competent job with the vocabulary they had. The vocabulary is a generation behind the surface area.

Procurement teams are still optimizing for the wrong contract. The standard MSA fights battles from the 2010s — outage credits, breach notification windows, indemnification for IP that makes it into the source repository. AI vendor relationships have a different attack surface, and the clauses that matter most are the ones that don't have a heading in your existing template. The team that lets last year's procurement playbook handle this year's vendor stack is signing away leverage they will need within a year.

The Autonomy Toggle: When Agent Mode Should Be a User Setting, Not a Model Setting

· 10 min read
Tian Pan
Software Engineer

The most expensive product decision in an agent product is invisible in the UI: somebody on the engineering team picked a single autonomy level and shipped it as a global default. The cautious user types three messages of clarifying questions for a task they wanted done. The power user closes the tab because every single step needs approval. Both look like product-market-fit problems. They are actually one design decision.

Autonomy is not a model property. It is a UX dimension — like notification frequency, display density, or default sort order — that different users want set differently for different tasks. Treating it as a hardcoded engineering choice forces a single point on a spectrum onto a user base that lives all along it. The fix is not a better default; the fix is exposing the dial.

Bug Bashes for AI Features: Sampling a Distribution, Not Hunting Defects

· 11 min read
Tian Pan
Software Engineer

The classic bug bash is a deterministic ritual built for deterministic software. Ten engineers crowd a Slack channel for two hours, hammer a checklist of golden-path flows, and file tickets with crisp repro steps: "Click X, see Y, expected Z." It works because the system under test is reproducible — same input, same output, same bug, every time.

Run that exact ritual against an AI feature and you will produce two hundred tickets, close one hundred and eighty as "expected stochastic variation," and miss the twenty that signal a real cohort regression. The format isn't just stale; it's actively miscalibrated. A bug bash against an LLM-backed feature is not a defect-hunting session. It is a sampling exercise against a probability distribution, and the team that runs it like a deterministic test session is collecting noise and calling it signal.

This post is about how to redesign the bug bash for stochastic systems — what to change about the format, the participants, the triage rubric, and what counts as "done."

The Closed-Loop Escalation Bug: When Your Specialist Agents Route in Circles

· 11 min read
Tian Pan
Software Engineer

A multi-agent system for market data research quietly burned through $47,000 in inference cost over four weeks before anyone noticed. The original weekly bill was $127. The cause wasn't a traffic spike or a model upgrade — it was two agents passing the same conversation back and forth for eleven days, each one confident the other was the right place for the request to live. Nothing errored. No alarm fired. The bot's "queue transferred" metric and the other bot's "task received" metric both went up in lockstep, and both dashboards looked healthy.

This is the closed-loop escalation bug. It is the multi-agent version of two helpful colleagues each insisting "no, you take it," except neither of them ever gets bored and walks away. The architecture diagram you drew at design time has each specialist owning a clean slice of the problem. The architecture the runtime actually executes has a routing cycle nobody in the room can see.

Cohort-Aware Fine-Tuning: When One Model Isn't Enough But Per-User Is Too Much

· 11 min read
Tian Pan
Software Engineer

A team I talked to last quarter shipped a fine-tuned model that beat their base by four points on their internal eval, then watched their top three customers churn over the following six weeks. The eval was fine. The aggregate was fine. The fine-tune just happened to win on the median user, who was a small-business buyer asking short factual questions, while silently regressing on the enterprise legal cohort whose long, citation-heavy queries had been the actual revenue driver. Nobody had sliced the eval by customer tier because nobody on the modeling side knew the customer tier mattered.

Most fine-tuning conversations live at one of two extremes. On one end, the "one fine-tune to rule them all" approach trains a single specialized model on a mix of all customer data and washes out the cohort-specific behavior that actually distinguished segments in the base model. On the other end, the "per-customer fine-tune" approach trains a separate adapter for each tenant, which is operationally tolerable below a hundred customers and falls apart somewhere around a few hundred. The interesting middle ground — where a small number of cohort-aware fine-tunes serve a segmented user base — is missing from most production playbooks.

The 90-Second Cold Start for Production Agents: When the LLM Isn't the Slow Part

· 10 min read
Tian Pan
Software Engineer

A user clicks the button. Ninety seconds later they get their first token. The team's response, almost reflexively, is to ask the model vendor for a faster TTFT — and the vendor's TTFT is 800 milliseconds. The model was never the slow part. The request waited 30 seconds for a tool registry to load, 20 seconds for a vector store client to negotiate its first connection, 15 seconds for the prompt cache to prime on a fresh container, and another 10 seconds for an agent framework to validate every tool schema in its registry against a JSON schema validator that was loading on first use.

This is the agent cold start, and it has almost nothing to do with the model. Teams that profile only the LLM call are optimizing the part of their request that wasn't slow. Worse, the cold start is invisible in steady state — load tests against a warm pool look great, dashboards plotted on the median look great, and the people who notice are the users who hit the first request after a deploy, an autoscaling event, or a low-traffic stretch where everything got recycled.

Your CS Team Built a Shadow Agent. That's Your Roadmap.

· 9 min read
Tian Pan
Software Engineer

A senior CSM in your support org spent a weekend wiring up an internal Slack bot. They wrote the system prompt themselves. They pointed it at the public docs, a Zendesk export of resolved tickets, and the changelog. Six weeks later it answers about 40% of the tier-1 questions their team used to type out by hand. Nobody on your engineering org chart knows it exists. The first time the platform team finds out, somebody from security will be asking why a service account is hitting Zendesk's API at 3am.

The default reaction is panic. Lock down the API token. Send a company-wide email about unsanctioned AI. Add a slide to the next governance review. Then promise that the platform team will build "the official version" next quarter, on the proper roadmap.

That reaction misses what actually happened. The CS team didn't go rogue — they built a working prototype of a product the engineering team hasn't shipped. They have real usage data, real prompt iteration cycles, and real user feedback. Your platform roadmap has none of those. Treating the bot as a compliance violation throws away the most accurate prioritization signal your AI program is going to get this year.

The Eval Automation Trap: When Your Pipeline Drifts Away From What Users Actually Want

· 10 min read
Tian Pan
Software Engineer

Your eval pipeline scores are trending up. Response quality is improving. The LLM judge is catching more bad outputs. Your dashboard is green.

Meanwhile, a support ticket trickles in: "The assistant keeps giving me long, formal answers when I asked a simple question." Then another: "It stopped suggesting next steps. Used to do that automatically." Then your product manager shows you a chart: user satisfaction down 12% over the last quarter, correlated almost perfectly with the stretch where your automated eval metrics were climbing fastest.

This is the eval automation trap. Your measurement apparatus became optimized for itself rather than for what your users value — and because the feedback loop was entirely automated, nobody noticed until the damage was already in production.

The Fallback Cascade: Why Your AI Feature Needs Five Failure Modes, Not One

· 9 min read
Tian Pan
Software Engineer

Most AI features ship with exactly two states: working and broken. The model call succeeds and the feature responds; the model call fails and the user sees an error. This is the equivalent of building a web service with no load balancing, no cache, and a single database replica — technically functional until the moment it isn't.

The difference is that engineers learned database resilience patterns in the 1990s and have internalized them deeply. AI feature resilience is still being discovered the hard way, one production outage at a time. A payment processor lost $2.3M in a four-hour AI outage. A logistics company missed delivery windows for 30,000 packages when their routing model went down. Both failures shared a root cause: when the primary model was unavailable, there was nothing to fall back to.