Skip to main content

24 posts tagged with "devops"

View all tags

The AI Changelog Problem: Why Your Prompt Updates Are Breaking Other Teams

· 11 min read
Tian Pan
Software Engineer

A platform team ships a one-line tweak to the system prompt of their summarization service. No code review, no migration guide, no version bump — it's "just a prompt." Two weeks later, the legal product team finds out their compliance auto-redaction has been silently letting names through. The investigation eats a sprint. The fix is trivial. The damage is the trust.

This is the AI changelog problem in miniature. Behavior is now a first-class output of your system, and behavior changes when prompts, models, retrievers, or tool schemas change — none of which show up in git diff of the consuming application. Teams that treat AI updates like backend deploys, where a Slack message in #releases is enough, end up reinventing the worst parts of the early-2010s "we'll just push and tell QA later" workflow.

Model Migration as Database Migration: Safely Switching LLM Providers Without Breaking Production

· 10 min read
Tian Pan
Software Engineer

When your team decides to upgrade from Claude 3.5 Sonnet to Claude 3.7, or migrate from OpenAI to a self-hosted Llama deployment, the instinct is to treat it like a library upgrade: change the API key, update the model name string, run a quick sanity check, and ship. This instinct is wrong, and the teams that follow it discover why at 2 AM in week two when a customer support agent starts producing responses in a completely different format — technically valid, semantically disastrous.

Switching LLM providers or model versions is structurally identical to a database schema migration. Both involve changing the behavior of a system that the rest of your application has implicit contracts with. Both can look fine on day one and fail catastrophically on day ten. Both require dual-running, canary deployment, rollback criteria, and a migration playbook — not a config change followed by a Slack message.

Prompt Versioning Done Right: Treating LLM Instructions as Production Software

· 8 min read
Tian Pan
Software Engineer

Three words. That's all it took.

A team added three words to an existing prompt to improve "conversational flow" — a tweak that seemed harmless in the playground. Within hours, structured-output error rates spiked, a revenue-generating workflow stopped functioning, and engineers were scrambling to reconstruct what the prompt had said before the change. No version history. No rollback. Just a Slack message from someone who remembered it "roughly" and a diff against an obsolete copy in a Google Doc.

This is not a hypothetical. It is a pattern repeated across nearly every organization that ships LLM features at scale. Prompts start as strings in application code, evolve through informal edits, accumulate undocumented micro-adjustments, and eventually reach a state where nobody is confident about what's running in production or why it behaves the way it does.

The fix is not a new tool. It's discipline applied to something teams have been treating as config.

AI Agents in Your CI Pipeline: How to Gate Deployments That Can't Be Unit Tested

· 10 min read
Tian Pan
Software Engineer

Shipping a feature that calls an LLM is easy. Knowing whether the next version of that feature is better or worse than the one in production is hard. Traditional CI/CD gives you a pass/fail signal on deterministic behavior: either the function returns the right value or it doesn't. But when the function wraps a language model, the output is probabilistic — the same input produces different outputs across runs, across model versions, and across days.

Most teams respond to this by skipping the problem. They run their unit tests, do a quick manual check on a few prompts, and ship. That works until it doesn't — until a model provider silently updates the underlying weights, or a prompt change that looked fine in isolation shifts the output distribution in ways that only become obvious in production at 3 AM.

The better answer isn't to pretend LLM outputs are deterministic. It's to build CI gates that operate on distributions, thresholds, and rubrics rather than exact matches.

The LLM Provider Incident Runbook: Staying Up When Your AI Stack Goes Down

· 11 min read
Tian Pan
Software Engineer

In December 2024, OpenAI's entire platform went dark for over four hours. A new telemetry service had been deployed with a configuration that caused every node in a massive fleet to simultaneously hammer the Kubernetes API. DNS broke. The control plane buckled. Every service went with it. Recovery took so long partly because the team lacked what they later called "break-glass tooling" — pre-built emergency mechanisms they could reach for when normal procedures stopped working.

If you were running an AI-powered product that day, you were making decisions fast under pressure. Multi-provider routing? Graceful degradation? Cached responses? Or just a status page and a prayer?

This is the runbook you should have written before that call came in.

The Provider Abstraction Tax: Building LLM Applications That Can Swap Models Without Rewrites

· 10 min read
Tian Pan
Software Engineer

A healthcare startup migrated from one major frontier model to a newer version of the same provider's offering. The result: 400+ engineering hours to restore feature parity. The new model emitted five times as many tokens per response, eliminating projected cost savings. It started offering unsolicited diagnostic opinions—a liability problem. And it broke every JSON parser downstream because it wrapped responses in markdown code fences. Same provider, different model, total rewrite.

This is the provider abstraction tax: not the cost of switching providers, but the cumulative cost of not planning for it. It is not a single migration event. It is an ongoing drain—the behavioral regressions you discover three weeks after an upgrade, the prompt engineering work that does not transfer across models, the retry logic that silently fails because one provider measures rate limits by input tokens separately from output tokens. Teams that build directly on a single provider accumulate this debt invisibly, until a deprecation notice or a pricing change makes the bill come due all at once.

The Model EOL Clock: Treating Provider LLMs as External Dependencies

· 11 min read
Tian Pan
Software Engineer

In January 2026, OpenAI retired several GPT models from ChatGPT with two weeks' notice — weeks after its CEO had publicly promised "plenty of notice" following an earlier backlash. For teams that had built workflows around those models, the announcement arrived like a pager alert on a Friday afternoon. The API remained unaffected that time. But it won't always.

Every model you're currently calling has a deprecation date. Some of those dates are already listed on your provider's documentation page. Others haven't been announced yet. The operational question isn't whether your production model will be retired — it's whether you'll find out in time to handle it gracefully, or scramble to migrate after users start seeing failures.

Prompt Linting: The Pre-Deployment Gate Your AI System Is Missing

· 8 min read
Tian Pan
Software Engineer

Every serious engineering team runs a linter before merging code. ESLint catches undefined variables. Prettier enforces formatting. Semgrep flags security anti-patterns. Nobody ships JavaScript to production without running at least one static check first.

Now consider what your team does before shipping a prompt change. If you're like most teams, the answer is: review it in a PR, eyeball it, maybe test it manually against a few inputs. Then merge. The system prompt for your production AI feature — the instruction set that controls how the model behaves for every single user — gets less pre-deployment scrutiny than a CSS change.

This gap is not a minor process oversight. A study analyzing over 2,000 developer prompts found that more than 10% contained vulnerabilities to prompt injection attacks, and roughly 4% had measurable bias issues — all without anyone noticing before deployment. The tooling to catch these automatically exists. Most teams just haven't wired it in yet.

Agent Credential Rotation: The DevOps Problem Nobody Mapped to AI

· 8 min read
Tian Pan
Software Engineer

Every DevOps team has a credential rotation policy. Most have automated it for their services, CI pipelines, and databases. But the moment you deploy an autonomous AI agent that holds API keys across five different integrations, that rotation policy becomes a landmine. The agent is mid-task — triaging a bug, updating a ticket, sending a Slack notification — and suddenly its GitHub token expires. The process looks healthy. The logs show no crash. But silently, nothing works anymore.

This is the credential rotation problem that nobody mapped from DevOps to AI. Traditional rotation assumes predictable, human-managed workloads with clear boundaries. Autonomous agents shatter every one of those assumptions.

Simulation Environments for Agent Testing: Building Sandboxes Where Consequences Are Free

· 10 min read
Tian Pan
Software Engineer

Your agent passes every test in staging. Then it hits production and sends 4,000 emails, charges a customer twice, and deletes a record it wasn't supposed to touch. The staging tests weren't wrong — they just tested the wrong things. The staging environment made the agent look safe because everything it could break was fake in the wrong way: mocked just enough to not crash, but realistic enough to fool you into thinking the test meant something.

This is the simulation fidelity trap. It's different from ordinary software testing failures. For a deterministic function, a staging environment that mirrors production schemas and APIs is usually sufficient. For an agent, behavior emerges from the interaction between reasoning, tool outputs, and accumulated state across a multi-step trajectory. A staging environment that diverges from production in any of those dimensions will produce agents that are systematically over-confident about how they'll behave under real conditions.

CI/CD for LLM Applications: Why Deploying a Prompt Is Nothing Like Deploying Code

· 10 min read
Tian Pan
Software Engineer

Your code ships through a pipeline: feature branch → pull request → automated tests → staging → production. Every step is gated. Nothing reaches users without passing the checks you've defined. It's boring in the best way.

Now imagine you need to update a system prompt. You edit the string in your dashboard, hit save, and the change is live immediately — no tests, no staging, no diff in version control, no way to roll back except by editing it back by hand. This is how most teams operate, and it's the reason prompt changes are the primary source of unexpected production outages for LLM applications.

The challenge isn't that teams are careless. It's that the discipline of continuous delivery was built for deterministic systems, and LLMs aren't deterministic. The entire mental model needs to be rebuilt from scratch.

Prompt Versioning and Change Management in Production AI Systems

· 9 min read
Tian Pan
Software Engineer

A team added three words to a customer service prompt to make it "more conversational." Within hours, structured-output error rates spiked and a revenue-generating pipeline stalled. Engineers spent most of a day debugging infrastructure and code before anyone thought to look at the prompt. There was no version history. There was no rollback. The three-word change had been made inline, in a config file, by a product manager who had no reason to think it was risky.

This is the canonical production prompt incident. Variations of it play out at companies of every size, and the root cause is almost always the same: prompts were treated as ephemeral configuration instead of software.