Skip to main content

24 posts tagged with "platform-engineering"

View all tags

The Agent Memory Store That Survived Your Tenant Deletion Because Nobody Owned It

· 10 min read
Tian Pan
Software Engineer

A compliance program is a description of the systems your company had on the day the auditor signed off. The systems your company has today are a different set, and the gap is the surface area of every release that shipped a new persistent store between then and now. The deletion guarantee you sold your customers is a guarantee against the first set, and the regulator who eventually asks about it will be asking about the second.

The failure mode is not a bug in the deletion code. The deletion code is correct. The saga fans out across every storage system named in the data inventory, calls each one's erasure endpoint, collects a receipt per system, and reports success when every receipt comes back signed. The saga is doing exactly what it was built to do. The problem is that the saga is iterating over a list of storage systems that was true eighteen months ago, and the agent platform team shipped a long-term memory feature six months ago that nobody added to the list.

The Model Registry Your Platform Team Built That Nobody Updated

· 12 min read
Tian Pan
Software Engineer

A platform team I know spent two quarters building a model registry. It had everything the org chart asked for: a promotion workflow from dev to staging to prod, a CODEOWNERS-style approval matrix, lineage tracking, eval-score gates, a deprecation policy with a 30-day window, and a Backstage tile that showed which version of every model was live in which service. They cut a launch announcement, ran a brown bag, and added a row to the compliance binder.

Six months later, the highest-traffic agent in the company was running on a model card whose "owner" field still pointed at someone who had left, whose eval score was from a benchmark the team had since deprecated, and whose "approved by" name was the platform tech lead — who had never used that agent, never read its eval set, and had pressed approve at 11:43pm on a Thursday because the producer had pinged him in DMs saying the launch was tomorrow.

The registry was not broken. The promotion gates fired. The audit log was intact. Everything the launch announcement had promised was true. And the org had less real oversight of its production models than it had had eighteen months earlier, when the same decisions were made by an ML engineer reading the eval output by hand before pasting the model URI into a config file.

The Silent Personalization Layer Your Customers Could Not Reproduce

· 11 min read
Tian Pan
Software Engineer

A platform team ships a quality improvement. An inference-time layer reads the user's recent interactions and silently nudges the response style: more formal here, more terse there, more technical when the history suggests an engineer is asking. The A/B test shows an aggregate satisfaction lift of a couple of points. The launch post goes out under the heading "smarter responses, no API changes required." Nobody flips a flag in the API. Nobody updates the docs. Nothing in the response payload indicates which persona the model just adopted.

Six weeks later an enterprise customer files a support ticket that says, "your model is worse than you advertised." Their internal eval suite — running the same prompts your team published benchmarks against — scores eight points lower. Your team's first move is to verify prompt parity. Prompts match exactly. Decoding parameters match. The model version string matches. The divergence traces to the personalization layer, which infers a "thin-history default persona" for the customer's freshly-provisioned test account and a richer one for the long-lived user accounts your benchmarks were measured against. The conversation about whether the personalization is a feature or a bug stops being a product decision and becomes a contract negotiation.

The Context Window Is a Commons, and Every Team Is Grazing It

· 10 min read
Tian Pan
Software Engineer

Open a production agent and count what is in the context window before the user has typed a single character. There is a system prompt the platform team owns. There are tool definitions — forty of them, maybe more — each carrying a name, a description, a JSON schema, field-level docs, and a handful of enums. There is a block of retrieved examples that the search team added because few-shot helped one eval. There are six lines of safety instructions from trust and safety, four lines of formatting rules from the design team, and a paragraph of domain glossary that someone added during an incident and nobody removed.

Add it up and the agent boots with 30,000 tokens of overhead. On a connected setup with three MCP servers, that number is routinely far worse — one widely cited measurement put three servers at 143,000 of a 200,000-token budget, 72% of the window consumed before the conversation starts. None of it is wrong. Every line was added by someone solving a real problem. And that is exactly why the context window is being destroyed.

The AI Gateway Is the SPOF Nobody Named

· 10 min read
Tian Pan
Software Engineer

The pitch sounded responsible. "Let's not hardcode OpenAI everywhere — we'll put a thin abstraction in front, then we can swap providers if we need to." Two years later, that thin abstraction is a service with its own deploy pipeline, its own SRE on-call, an eval gate that blocks bad prompts, a semantic cache that saves seven figures a year, a retry policy with provider-specific backoffs, an observability schema every dashboard depends on, and a key vault holding the credentials for six model vendors. Every AI feature in the company terminates there.

It is also, almost by accident, the single point of failure with the worst blast radius in the stack. When the primary LLM provider goes down — and in 2025 OpenAI was tracked having 294 outage events since January, with Anthropic logging 184.5 hours of total customer impact in December alone — the gateway routes around it and most users never notice. When the gateway itself dies, every AI feature in every product simultaneously stops, the failover that was supposed to fire never gets a chance, and the postmortem opens with "the abstraction layer we built to insulate us from provider outages was the outage."

The Internal-Tooling Agent: When Your Highest-Leverage AI Feature Has Zero Customers

· 10 min read
Tian Pan
Software Engineer

The most strategic AI investment in your company is probably a Slack bot one engineer built on a Friday afternoon. It answers "how do I get a staging credential" or "which on-call is responsible for the auth service" or "what's the runbook for a stuck deploy," and it has saved more engineering hours than the entire customer-facing AI roadmap that absorbs three quarters of your model spend, your safety review queue, and your launch comm bandwidth.

The org chart doesn't reflect this. The OKR doc doesn't reflect this. Nobody is the PM. Nobody is the EM. The bot survives because the engineer who built it still answers the GitHub issues, and the value compounds quietly while every customer-facing feature ships behind a six-week safety review and a launch readiness checklist that exists because the customer might churn.

We Already Have That: When AI Features Reinvent Code You Already Own

· 11 min read
Tian Pan
Software Engineer

A team I worked with shipped a "smart" date extractor last quarter. The model parsed natural-language phrases like "next Tuesday" and "two weeks from the 14th," ran in production behind a feature flag, and cost about three cents per request at the chosen tier. Six weeks later, a backend engineer wandered into a design review and mentioned, casually, that the company already had a date parser. It had been written in 2019, lived in a utility module nobody on the AI team had read, handled 99.4% of the same inputs at sub-millisecond latency, and ran for free. The AI feature did not get pulled. It got rationalized — "the model handles the long tail" — and the team moved on, having shipped a more expensive, slower, less accurate version of something the company already owned.

This is not a one-off story. It is the dominant failure mode for AI features inside companies older than the AI team. The pattern repeats: a smart classifier duplicates a regex pipeline written years ago, a retrieval system fetches a vendor list that an internal service has been maintaining as a typed table, an agent learns to extract entities a parser already extracts deterministically. The AI feature ships with a quality bar lower than the deterministic system it didn't know existed, and the team who built the deterministic system finds out at a cross-team meeting.

The Agent Portfolio Audit: How to Consolidate 15 Independent Agents Into a Platform Without Killing Team Autonomy

· 9 min read
Tian Pan
Software Engineer

Six months after launching their first AI agent, most engineering organizations discover they have fifteen of them. Not because anyone planned a fleet — because each team solved a real problem and shipped. The customer support team built a triage agent. The data team built a report-generation agent. Platform engineering built a runbook agent. Infrastructure built three more. None of them share auth, logging, tooling, or evaluation methodology. Tokens are bleeding from a dozen provider accounts and nobody can tell you which agent is responsible.

This is the moment that separates engineering organizations that can scale AI from those that can't. The answer is not to slow down agent development — it's to run a portfolio audit before entropy makes consolidation impossible.

Golden Paths for AI Agents: How Platform Teams Can Enable Adoption Without Becoming a Bottleneck

· 11 min read
Tian Pan
Software Engineer

The most common failure mode for AI platform teams isn't technical. It's organizational: the central platform team becomes a gate that every product team must pass through to get any AI capability into production. Request queue grows. Cycle times balloon from days to weeks. Product teams get frustrated and start stitching together unofficial workarounds — hardcoded API keys, shadow LLM integrations, vendor accounts on personal credit cards. By the time the platform team notices, half the organization is running AI outside any governance structure.

The problem isn't that platform teams care about governance. It's that they implemented governance as an approval workflow instead of as infrastructure.

AI Ops Is Not Platform Engineering: How Running LLM Services Breaks Your SRE Playbook

· 10 min read
Tian Pan
Software Engineer

Your SRE team is excellent at running microservices. They've mastered blue-green deployments, canary rollouts, distributed tracing, SLO burn-rate alerts, and postmortem culture. Then someone ships an LLM-powered feature, and within a week an incident happens that none of those practices were designed to handle: the model starts generating plausible-sounding but structurally wrong outputs, no error is logged, no health check fails, and users have been silently getting garbage for four hours before anyone noticed.

This isn't a skills gap. It's an architectural gap. Running LLM services is a distinct operational discipline from running microservices, and the practices that don't transfer will burn your team if you don't identify them explicitly.

The Shadow AI Problem: Why Engineers Bypass Your Official AI Platform and What to Do About It

· 9 min read
Tian Pan
Software Engineer

Your data governance audit probably found them: API keys for OpenAI and Anthropic billed to personal credit cards, Slack bots wired to Claude through personal accounts, local Ollama instances proxying requests through the corporate VPN. Nobody told platform engineering. Nobody asked IT. The engineers just... did it.

This is the shadow AI problem, and it is already inside your organization whether you have detected it yet or not. Roughly half of employees in knowledge-work environments report using AI tools that their employers have not sanctioned. Among software engineers — who have the technical skill to set up unofficial integrations and the productivity pressure to want them — that number is almost certainly higher.

The instinct of most security and platform teams is to respond with prohibition: block the endpoints, restrict the API keys, add AI tool requests to the procurement queue. That response reliably produces more shadow AI, not less, because it treats a platform design failure as a compliance failure.

Your CS Team Built a Shadow Agent. That's Your Roadmap.

· 9 min read
Tian Pan
Software Engineer

A senior CSM in your support org spent a weekend wiring up an internal Slack bot. They wrote the system prompt themselves. They pointed it at the public docs, a Zendesk export of resolved tickets, and the changelog. Six weeks later it answers about 40% of the tier-1 questions their team used to type out by hand. Nobody on your engineering org chart knows it exists. The first time the platform team finds out, somebody from security will be asking why a service account is hitting Zendesk's API at 3am.

The default reaction is panic. Lock down the API token. Send a company-wide email about unsanctioned AI. Add a slide to the next governance review. Then promise that the platform team will build "the official version" next quarter, on the proper roadmap.

That reaction misses what actually happened. The CS team didn't go rogue — they built a working prototype of a product the engineering team hasn't shipped. They have real usage data, real prompt iteration cycles, and real user feedback. Your platform roadmap has none of those. Treating the bot as a compliance violation throws away the most accurate prioritization signal your AI program is going to get this year.