Skip to main content

8 posts tagged with "ai-platform"

View all tags

Bring-Your-Own-Key for AI Features: The Sales-Driven Re-Architecture Nobody Costed

· 10 min read
Tian Pan
Software Engineer

The procurement team you're selling to will eventually ask the one question that resets your architecture: "Can we bring our own model API key?" Saying yes wins the deal. Saying yes also moves your trust boundary, your cost boundary, and your operational boundary at the same time — and most product teams discover this only after the contract is signed and the first month of usage produces a support ticket nobody knows how to answer.

BYOK is sold internally as a toggle. The customer pastes a key, your code reads it from the vault instead of from your own account, and inference flows the same way it always did. It is not a toggle. It is a sales-driven re-architecture that ripples through cost attribution, security incident response, observability, rate limiting, model-version pinning, and on-call accountability. The teams that ship it without acknowledging this end up rebuilding their entire platform layer a year later while a paying enterprise customer waits for fixes.

Prompt Portfolios: Manage a Basket, Not a Single Best Prompt

· 10 min read
Tian Pan
Software Engineer

Most production AI teams talk about prompts the way junior traders talk about stocks: there is one best one, and the job is to find it. So they iterate — a Slack thread, a few eval rows, a new winner, push to main, repeat. The result is a single artifact carrying the entire intent-resolution surface of the product, optimized against a frozen evaluation set, sitting one regrettable edit away from a P1.

The mistake is the singular. A prompt is not a security; it is an allocation. The same user intent can be served well by several variants, each with its own confidence interval, its own per-segment performance, and its own sensitivity to model and corpus drift. The right mental model is not "find the best prompt" — it is "manage a basket of prompts whose composition is itself the product." Quantitative finance figured this out fifty years ago, and the operational machinery transfers almost without modification.

The Platform-Readiness Gap: When AI Features Ship Before the Infra to Operate Them

· 11 min read
Tian Pan
Software Engineer

The launch is not the moment an AI feature ships. It is the moment the platform team inherits a production system they had no chance to design.

A product team prototypes a feature. The demo lands well with the executive team. A launch date gets set. And somewhere between the slide deck and the rollout, the feature ships into production before anyone built the eval harness, the prompt registry, the routing layer, the cost dashboards, the rollback primitive, the on-call rotation that knows what an agent looks like, or the secrets-rotation policy for the new vendor's API keys. The feature works. The demo metrics are green. The platform team is now on the hook for an operational system whose primitives don't exist yet.

This is the platform-readiness gap, and it is the single most common reason that AI programs that look healthy at launch become unmanageable by the fifth feature.

The Inter-Team Token Budget War: When Your AI Platform Team Becomes a Treasury Department

· 10 min read
Tian Pan
Software Engineer

The team that built your internal LLM gateway scoped it for "rate limiting and audit." Eighteen months later, the same team is running a quarterly allocation meeting, mediating a quota dispute between two product groups, and discovering that the architecture they shipped to solve a capacity problem now functions as the company's internal AI treasury. Nobody chartered them for this role. Nobody removed it from their plate either.

This is the trajectory every AI platform team is on, and most of them get to the political economy phase before they have a policy, a sponsor, or even the telemetry to defend a decision. The technical work — request routing, key management, retries — is the easy half. The hard half is that finite provider quota plus three product teams with launch deadlines is a budget allocation system, and the team running the gateway is the one being asked to allocate.

AI Shadow IT: When Product Teams Build Their Own LLM Proxy

· 11 min read
Tian Pan
Software Engineer

The shadow IT incident your platform team is going to investigate in Q3 already happened in January. It looks like this: a senior engineer on a product team has a launch this month. The platform team's "official" LLM gateway is on the roadmap for "next quarter." So the engineer creates a corporate credit card OpenAI account, drops the API key into a .env file, ships the feature, and hits the public deadline. The launch is a success. Six months later, the FinOps team finds three vendor accounts nobody can attribute, the security team finds prompts containing customer data routed to a region not covered by the data processing agreement, and the platform team discovers the gateway it spent two quarters building has 14% adoption because every team that needed AI shipped without it.

This is not a security failure or a discipline failure. It is a platform-product velocity mismatch, and treating it as anything else guarantees the next gateway you ship will have the same adoption problem.

Model Rollback Velocity: The Seven-Hour Gap Between 'This Upgrade Is Wrong' and 'Old Model Fully Restored'

· 12 min read
Tian Pan
Software Engineer

The playbook for a bad code deploy is a sub-minute revert. The playbook for a bad config push is a sub-second flag flip. The playbook for a bad model upgrade is whatever the on-call invents at 09:14, and on a typical day it takes seven hours to finish. During those seven hours the regression keeps compounding — wrong answers ship to customers, support tickets pile up, and the dashboard shows a slow gradient rather than a clean cliff back to green.

The reason the gap is seven hours is not that the team is slow. It is that "rollback" for a model upgrade is not the same primitive as "rollback" for code. It is closer to a database schema migration: partial, hysteretic, and not reversible by pressing the button you wish existed. The team that wrote its incident playbook around a button does not have the controls the actual rollback requires.

This post is about what those controls look like, why they have to be paid for in advance, and what you find out about your platform the first time you try to roll back a model under load.

JSON Mode Is a Dialect, Not a Standard: The Silent Breakage in Your Fallback Path

· 11 min read
Tian Pan
Software Engineer

The first time I watched a fallback router cause a worse incident than the outage it was trying to mitigate, the postmortem document had a header that read: "Primary degraded for 11 minutes. Fallback degraded our parser for 6 days." Nobody had written code wrong. Nobody had skipped the schema review. The integration tests against the secondary provider had been green when the fallback was wired up, eighteen months earlier. What had happened in between was that one of the two providers had quietly tightened its enum coercion policy, and the contract our downstream parsers had been written against — a contract we believed was "JSON Schema, more or less" — had drifted from a shared standard into two slightly incompatible dialects.

This is the failure mode I keep seeing, and it keeps surprising teams that should know better. "JSON mode" sounds like a feature you turn on. It is not. It is a contract you maintain — separately, against every provider you might route to — and the contract drifts every quarter as vendors evolve their structured-output stacks. The "drop-in replacement" your provider docs gestured at when you signed the contract is, in production, a maintained translation layer whose absence converts your fallback path into a paper compliance artifact: present in the architecture diagram, broken on the day you needed it.

Token Budgets Are the New Internal IAM

· 11 min read
Tian Pan
Software Engineer

The first time your AI bill clears seven figures in a month, the budget meeting changes shape. Until then, the question is "can we afford this." After that, the question is "who gets how much" — and most engineering orgs discover, in real time, that they have no policy framework for answering it. The team that shipped the loudest demo holds the highest quota by accident. Finance pushes for flat per-headcount caps that starve the team doing the highest-leverage work. Security gets cut out of the conversation entirely until somebody notices that the eval team has been pulling production traffic through their personal token allowance for six months.

The reason this conversation always feels like a cloud-cost argument is that it almost is one — but not quite. With cloud, the unit of waste is a forgotten EC2 instance and the worst case is a 3x bill. With token quotas, the unit of waste is a runaway agent loop, and the unit of access is a user-facing capability: whoever holds the budget can ship the feature. That second property is what makes token allocation rhyme with capability-based security instead of with cloud FinOps. The quota is not just a spending cap. It is the right to make a class of inferences happen.