Skip to main content

4 posts tagged with "ai-platform"

View all tags

AI Shadow IT: When Product Teams Build Their Own LLM Proxy

· 11 min read
Tian Pan
Software Engineer

The shadow IT incident your platform team is going to investigate in Q3 already happened in January. It looks like this: a senior engineer on a product team has a launch this month. The platform team's "official" LLM gateway is on the roadmap for "next quarter." So the engineer creates a corporate credit card OpenAI account, drops the API key into a .env file, ships the feature, and hits the public deadline. The launch is a success. Six months later, the FinOps team finds three vendor accounts nobody can attribute, the security team finds prompts containing customer data routed to a region not covered by the data processing agreement, and the platform team discovers the gateway it spent two quarters building has 14% adoption because every team that needed AI shipped without it.

This is not a security failure or a discipline failure. It is a platform-product velocity mismatch, and treating it as anything else guarantees the next gateway you ship will have the same adoption problem.

Model Rollback Velocity: The Seven-Hour Gap Between 'This Upgrade Is Wrong' and 'Old Model Fully Restored'

· 12 min read
Tian Pan
Software Engineer

The playbook for a bad code deploy is a sub-minute revert. The playbook for a bad config push is a sub-second flag flip. The playbook for a bad model upgrade is whatever the on-call invents at 09:14, and on a typical day it takes seven hours to finish. During those seven hours the regression keeps compounding — wrong answers ship to customers, support tickets pile up, and the dashboard shows a slow gradient rather than a clean cliff back to green.

The reason the gap is seven hours is not that the team is slow. It is that "rollback" for a model upgrade is not the same primitive as "rollback" for code. It is closer to a database schema migration: partial, hysteretic, and not reversible by pressing the button you wish existed. The team that wrote its incident playbook around a button does not have the controls the actual rollback requires.

This post is about what those controls look like, why they have to be paid for in advance, and what you find out about your platform the first time you try to roll back a model under load.

JSON Mode Is a Dialect, Not a Standard: The Silent Breakage in Your Fallback Path

· 11 min read
Tian Pan
Software Engineer

The first time I watched a fallback router cause a worse incident than the outage it was trying to mitigate, the postmortem document had a header that read: "Primary degraded for 11 minutes. Fallback degraded our parser for 6 days." Nobody had written code wrong. Nobody had skipped the schema review. The integration tests against the secondary provider had been green when the fallback was wired up, eighteen months earlier. What had happened in between was that one of the two providers had quietly tightened its enum coercion policy, and the contract our downstream parsers had been written against — a contract we believed was "JSON Schema, more or less" — had drifted from a shared standard into two slightly incompatible dialects.

This is the failure mode I keep seeing, and it keeps surprising teams that should know better. "JSON mode" sounds like a feature you turn on. It is not. It is a contract you maintain — separately, against every provider you might route to — and the contract drifts every quarter as vendors evolve their structured-output stacks. The "drop-in replacement" your provider docs gestured at when you signed the contract is, in production, a maintained translation layer whose absence converts your fallback path into a paper compliance artifact: present in the architecture diagram, broken on the day you needed it.

Token Budgets Are the New Internal IAM

· 11 min read
Tian Pan
Software Engineer

The first time your AI bill clears seven figures in a month, the budget meeting changes shape. Until then, the question is "can we afford this." After that, the question is "who gets how much" — and most engineering orgs discover, in real time, that they have no policy framework for answering it. The team that shipped the loudest demo holds the highest quota by accident. Finance pushes for flat per-headcount caps that starve the team doing the highest-leverage work. Security gets cut out of the conversation entirely until somebody notices that the eval team has been pulling production traffic through their personal token allowance for six months.

The reason this conversation always feels like a cloud-cost argument is that it almost is one — but not quite. With cloud, the unit of waste is a forgotten EC2 instance and the worst case is a 3x bill. With token quotas, the unit of waste is a runaway agent loop, and the unit of access is a user-facing capability: whoever holds the budget can ship the feature. That second property is what makes token allocation rhyme with capability-based security instead of with cloud FinOps. The quota is not just a spending cap. It is the right to make a class of inferences happen.