Skip to main content

20 posts tagged with "platform-engineering"

View all tags

The AI Gateway Is the SPOF Nobody Named

· 10 min read
Tian Pan
Software Engineer

The pitch sounded responsible. "Let's not hardcode OpenAI everywhere — we'll put a thin abstraction in front, then we can swap providers if we need to." Two years later, that thin abstraction is a service with its own deploy pipeline, its own SRE on-call, an eval gate that blocks bad prompts, a semantic cache that saves seven figures a year, a retry policy with provider-specific backoffs, an observability schema every dashboard depends on, and a key vault holding the credentials for six model vendors. Every AI feature in the company terminates there.

It is also, almost by accident, the single point of failure with the worst blast radius in the stack. When the primary LLM provider goes down — and in 2025 OpenAI was tracked having 294 outage events since January, with Anthropic logging 184.5 hours of total customer impact in December alone — the gateway routes around it and most users never notice. When the gateway itself dies, every AI feature in every product simultaneously stops, the failover that was supposed to fire never gets a chance, and the postmortem opens with "the abstraction layer we built to insulate us from provider outages was the outage."

The Internal-Tooling Agent: When Your Highest-Leverage AI Feature Has Zero Customers

· 10 min read
Tian Pan
Software Engineer

The most strategic AI investment in your company is probably a Slack bot one engineer built on a Friday afternoon. It answers "how do I get a staging credential" or "which on-call is responsible for the auth service" or "what's the runbook for a stuck deploy," and it has saved more engineering hours than the entire customer-facing AI roadmap that absorbs three quarters of your model spend, your safety review queue, and your launch comm bandwidth.

The org chart doesn't reflect this. The OKR doc doesn't reflect this. Nobody is the PM. Nobody is the EM. The bot survives because the engineer who built it still answers the GitHub issues, and the value compounds quietly while every customer-facing feature ships behind a six-week safety review and a launch readiness checklist that exists because the customer might churn.

We Already Have That: When AI Features Reinvent Code You Already Own

· 11 min read
Tian Pan
Software Engineer

A team I worked with shipped a "smart" date extractor last quarter. The model parsed natural-language phrases like "next Tuesday" and "two weeks from the 14th," ran in production behind a feature flag, and cost about three cents per request at the chosen tier. Six weeks later, a backend engineer wandered into a design review and mentioned, casually, that the company already had a date parser. It had been written in 2019, lived in a utility module nobody on the AI team had read, handled 99.4% of the same inputs at sub-millisecond latency, and ran for free. The AI feature did not get pulled. It got rationalized — "the model handles the long tail" — and the team moved on, having shipped a more expensive, slower, less accurate version of something the company already owned.

This is not a one-off story. It is the dominant failure mode for AI features inside companies older than the AI team. The pattern repeats: a smart classifier duplicates a regex pipeline written years ago, a retrieval system fetches a vendor list that an internal service has been maintaining as a typed table, an agent learns to extract entities a parser already extracts deterministically. The AI feature ships with a quality bar lower than the deterministic system it didn't know existed, and the team who built the deterministic system finds out at a cross-team meeting.

The Agent Portfolio Audit: How to Consolidate 15 Independent Agents Into a Platform Without Killing Team Autonomy

· 9 min read
Tian Pan
Software Engineer

Six months after launching their first AI agent, most engineering organizations discover they have fifteen of them. Not because anyone planned a fleet — because each team solved a real problem and shipped. The customer support team built a triage agent. The data team built a report-generation agent. Platform engineering built a runbook agent. Infrastructure built three more. None of them share auth, logging, tooling, or evaluation methodology. Tokens are bleeding from a dozen provider accounts and nobody can tell you which agent is responsible.

This is the moment that separates engineering organizations that can scale AI from those that can't. The answer is not to slow down agent development — it's to run a portfolio audit before entropy makes consolidation impossible.

Golden Paths for AI Agents: How Platform Teams Can Enable Adoption Without Becoming a Bottleneck

· 11 min read
Tian Pan
Software Engineer

The most common failure mode for AI platform teams isn't technical. It's organizational: the central platform team becomes a gate that every product team must pass through to get any AI capability into production. Request queue grows. Cycle times balloon from days to weeks. Product teams get frustrated and start stitching together unofficial workarounds — hardcoded API keys, shadow LLM integrations, vendor accounts on personal credit cards. By the time the platform team notices, half the organization is running AI outside any governance structure.

The problem isn't that platform teams care about governance. It's that they implemented governance as an approval workflow instead of as infrastructure.

AI Ops Is Not Platform Engineering: How Running LLM Services Breaks Your SRE Playbook

· 10 min read
Tian Pan
Software Engineer

Your SRE team is excellent at running microservices. They've mastered blue-green deployments, canary rollouts, distributed tracing, SLO burn-rate alerts, and postmortem culture. Then someone ships an LLM-powered feature, and within a week an incident happens that none of those practices were designed to handle: the model starts generating plausible-sounding but structurally wrong outputs, no error is logged, no health check fails, and users have been silently getting garbage for four hours before anyone noticed.

This isn't a skills gap. It's an architectural gap. Running LLM services is a distinct operational discipline from running microservices, and the practices that don't transfer will burn your team if you don't identify them explicitly.

The Shadow AI Problem: Why Engineers Bypass Your Official AI Platform and What to Do About It

· 9 min read
Tian Pan
Software Engineer

Your data governance audit probably found them: API keys for OpenAI and Anthropic billed to personal credit cards, Slack bots wired to Claude through personal accounts, local Ollama instances proxying requests through the corporate VPN. Nobody told platform engineering. Nobody asked IT. The engineers just... did it.

This is the shadow AI problem, and it is already inside your organization whether you have detected it yet or not. Roughly half of employees in knowledge-work environments report using AI tools that their employers have not sanctioned. Among software engineers — who have the technical skill to set up unofficial integrations and the productivity pressure to want them — that number is almost certainly higher.

The instinct of most security and platform teams is to respond with prohibition: block the endpoints, restrict the API keys, add AI tool requests to the procurement queue. That response reliably produces more shadow AI, not less, because it treats a platform design failure as a compliance failure.

Your CS Team Built a Shadow Agent. That's Your Roadmap.

· 9 min read
Tian Pan
Software Engineer

A senior CSM in your support org spent a weekend wiring up an internal Slack bot. They wrote the system prompt themselves. They pointed it at the public docs, a Zendesk export of resolved tickets, and the changelog. Six weeks later it answers about 40% of the tier-1 questions their team used to type out by hand. Nobody on your engineering org chart knows it exists. The first time the platform team finds out, somebody from security will be asking why a service account is hitting Zendesk's API at 3am.

The default reaction is panic. Lock down the API token. Send a company-wide email about unsanctioned AI. Add a slide to the next governance review. Then promise that the platform team will build "the official version" next quarter, on the proper roadmap.

That reaction misses what actually happened. The CS team didn't go rogue — they built a working prototype of a product the engineering team hasn't shipped. They have real usage data, real prompt iteration cycles, and real user feedback. Your platform roadmap has none of those. Treating the bot as a compliance violation throws away the most accurate prioritization signal your AI program is going to get this year.

The Hidden Edges Between Your AI Features: When One Prompt Edit Regresses Three Other Teams

· 9 min read
Tian Pan
Software Engineer

A platform engineer changes the opening sentence of the company's "house style" preamble — a single line that anchors voice across customer-facing assistants. The change ships behind a flag. By Tuesday, the search team's relevance regression has spiked, the support bot's eval pass-rate has dropped four points, and the onboarding agent's retry rate has doubled. None of those teams touched their own code. None of them got a heads-up. The platform engineer has no idea any of this happened, because nobody was on the receiving end of an alert that said "your edit just broke three downstream features."

This is the failure mode that defines the second year of an AI org's life. The first year, every team builds its own thing in a corner. The second year, those corners start sharing artifacts — a prompt fragment here, a seeded eval set there, a tool schema reused as a contract — and the moment that sharing becomes implicit, the dependency graph between AI features becomes invisible. You now have a distributed system whose edges no one can name.

The discipline that fixes this is not a new platform. It's drawing the graph.

AI Office Hours Don't Scale: When Your One Expert Becomes the Release Gate

· 11 min read
Tian Pan
Software Engineer

Open the calendar of the one engineer at your company who has shipped real AI features into production for more than six months. Count the recurring "30 min sync — questions about the agent" invites, the ad-hoc "can I grab you for 15?" Slack pings that ended up booked, the architecture-review attendances marked "optional" that they actually have to be at, and the office hours block that started as one Friday afternoon and now eats two hours every weekday. Then look at the roadmap and trace which features depend on a decision that engineer hasn't made yet. The intersection is your real release schedule. The Jira board is fiction.

This is the AI office hours bottleneck, and it is the load-bearing constraint inside more 2026 AI orgs than anyone in those orgs would say out loud. The team scaled AI feature work fast — every product squad got a model budget, every PM got a prompt — and routed every "is this the right model," "should we use RAG here," "is our eval design valid," "why is the cache hit rate weird" question to the one engineer who's actually shipped enough production AI to answer. Six months in, that engineer's calendar is the rate-limiting reagent for half the roadmap, and "I need to grab 30 minutes with them" is the load-bearing escalation path your incident response was supposed to make explicit.

The Internal LLM Gateway Is the New Service Mesh

· 10 min read
Tian Pan
Software Engineer

Walk into any company with fifty engineers writing LLM code in production and you will find seven gateway-shaped artifacts. The recommendations team built one to route between OpenAI and Anthropic. The support-bot team wrote one to attach their prompt registry. The platform team has a half-finished proxy that handles auth but not rate limiting. The growth team has a Lambda that does PII redaction on its way out. The data-science team is calling the vendor SDK directly and nobody has told them to stop. There is no shared gateway. There are seven shared problems, each solved poorly in isolation, and a CFO who is about to ask why the AI bill grew 40% quarter over quarter with no clear owner for any of it.

This is the same architectural beat the industry hit with microservices in 2016 and 2017. A thousand external dependencies, the same shared concerns at every team — auth, retries, observability, policy — and a choice between solving them once or rediscovering them everywhere. The answer then was the service mesh. The answer now is the internal LLM gateway, and most companies are still in the rediscovering-everywhere phase.

The Model Deprecation Treadmill: Discipline That Has to Exist Before the Sunset Email

· 13 min read
Tian Pan
Software Engineer

The team that treats "we use the latest model" as a virtue is one sunset email away from a quarter of unplanned work. By the time the deprecation notice lands, the architectural decisions that determine whether you can absorb it have already been made — months ago, by people who weren't thinking about migrations at all. The eval suite was implicitly trained against a specific checkpoint. The prompts were tuned against a specific refusal style. The cost projections assumed a specific token-per-task baseline. The router has a hardcoded fallback to a model that is itself about to disappear. None of these decisions look like risks until the email arrives, and then all of them look like the same risk.

Model deprecation is now the most predictable surprise in the AI stack. Anthropic gives a minimum of 60 days' notice on publicly released models. OpenAI's notice windows range from three months for specialized snapshots to 18 months for foundational models, but in practice a recent batch of ChatGPT model retirements landed with as little as two weeks' warning for some teams. GitHub deprecated a slate of Anthropic and OpenAI models in February 2026 in a single coordinated changelog entry. The pattern is no longer "if a model retires" — it's "every quarter, at least one model your stack depends on enters a retirement window, and the calendar isn't synchronized to your roadmap."