Skip to main content

118 posts tagged with "llm-ops"

View all tags

The Model-of-the-Week Roadmap: When Vendor Promises Become Committed Dependencies

· 9 min read
Tian Pan
Software Engineer

A product manager pulls up the next-quarter roadmap. Three features are marked "depends on next-gen model." Nobody asks what happens if next-gen slips, arrives 20% smaller than the demo suggested, or ships gated behind an enterprise tier your customers do not qualify for. Six months later, all three of those scenarios have happened, and the team is now rebuilding two quarters of architecture against the model that actually shipped — a different shape from the one they planned for.

This is the model-of-the-week roadmap: treating unreleased capability claims as committed dependencies. It is one of the most reliable ways to turn a twelve-month plan into a thirty-month plan, and it rarely looks risky in the moment because every vendor demo feels inevitable. The schedule damage is invisible until the slip compounds.

The Prompt Ownership Problem: When Conway's Law Comes for Your Prompts

· 11 min read
Tian Pan
Software Engineer

Every non-trivial AI product eventually develops a prompt that nobody is allowed to touch. It has three conditional branches, two inline examples pasted in during a customer-reported incident, and a sentence that begins with "IMPORTANT:" followed by a tone instruction nobody remembers writing. The prompt is 1,400 tokens. The PR that last modified it was reviewed by an engineer who has since changed teams. When a new model comes out, nobody is confident the prompt will still work. When evals regress, nobody is sure whether the prompt, the model, the retrieval pipeline, or a downstream tool caused it. The string is shared across four services. Every team has a local override. None of the overrides are documented.

This is the prompt ownership problem, and it is the single most under-discussed failure mode in multi-team AI engineering. It is not a technical problem. It is Conway's law reasserting itself at the token level. An organization's prompts end up mirroring its org chart, its RACI gaps, and its coordination tax — and the model, which does not care about your Jira hierarchy, produces correspondingly incoherent behavior for end users who do not care either.

Semantic Diff for Prompts: Why Git Diff Lies About What Your Prompt Change Will Do

· 10 min read
Tian Pan
Software Engineer

A teammate opens a pull request that rewrites your agent's system prompt from 420 lines to 380. The diff is green-and-red carnage: deleted paragraphs, moved sections, tightened language. You approve it because the cleanup looks sensible. A week later, refund-request accuracy has dropped eight points and nobody can say which line did it.

A different teammate adds the word "concise" to one instruction. Three characters of diff. Nobody reviews it closely because there is almost nothing to review. That edit flips tool-call behavior on 22% of queries.

The Ship-and-Pin Trap: How Model Version Stability Becomes Deprecation Debt

· 9 min read
Tian Pan
Software Engineer

Pinning a model version in production feels like engineering discipline. You lock claude-opus-4-0 or gpt-4o-2024-08-06 into config, write a note in the README about why, and move on to shipping features. The output distribution stops shifting under you, the evals stay green, and the prompt tuning you did last quarter keeps working. What you've actually done is start a silent timer. Twelve to fifteen months later the deprecation email arrives, and three sprints of undocumented behavioral dependencies — prompt tuning, eval calibration, output shape assumptions, temperature quirks — all come due at once.

This is the ship-and-pin trap. Pinning is correct in the short term and catastrophic in the long term, because the cost of stability compounds in places you aren't looking. The prompt that was "good enough" a year ago is now load-bearing in ways nobody documented. The JSON schema your downstream service expects was shaped to one model's tokenization habits. The few-shot examples you hand-tuned were tuned against a specific model's notion of helpfulness. When the provider retires the version string, none of these dependencies migrate automatically, and the work to re-qualify them always lands under deadline pressure.

Token Spend Is a Security Signal Your SOC Isn't Watching

· 11 min read
Tian Pan
Software Engineer

The fastest-moving breach signal in your stack isn't in your SIEM. It's in a spreadsheet someone in finance opens on the first of the month. When an attacker steals an LLM API key, exploits a prompt injection to exfiltrate data, or rides a compromised tenant session to query an adjacent customer's memory, the footprint shows up first as a token-usage anomaly — long before any DLP rule fires, any auth alert trips, or any endpoint agent notices something weird. Billing sees it. Security doesn't.

That gap is not theoretical. Sysdig's threat research team coined "LLMjacking" after watching attackers rack up five-figure daily bills on stolen cloud credentials, and the category has since matured into an organized criminal industry with $30-per-account marketplaces and documented campaigns pushing victim costs past $100,000 per day. OWASP catalogued a startup that ate a $200,000 bill in 48 hours from a leaked key. A Stanford research group burned $9,200 in 12 hours on a forgotten token in a Jupyter notebook. The common thread in every one of these incidents: the billing graph told the story hours or days before anyone in security noticed.

Time-to-First-Token Is the Latency SLO You Aren't Instrumenting

· 11 min read
Tian Pan
Software Engineer

Pull the last week of production traces and look at your latency dashboard. You almost certainly have p50 and p99 on total request latency. You probably have token throughput. You may even have a tokens-per-second chart, because a provider benchmark talked you into it. What you almost certainly do not have is a per-model, per-route, per-tenant histogram of time to first token — the single number that governs how fast your product feels.

This is not a small oversight. For any streaming interface — chat, code completion, agent sidebars, voice — perceived speed is set by how long the user stares at a blinking cursor before anything appears. Once the first token lands, the user is reading; subsequent tokens compete with their reading speed, not with their patience. Total latency matters for throughput planning and budget. TTFT matters for whether the product feels alive.

The gap between these two numbers is widening. Reasoning models can produce identical total latency to their non-reasoning siblings while pushing TTFT from 400 ms to 30 seconds. A routing change that "keeps latency flat" can silently turn a snappy assistant into a hanging window. If you are not graphing TTFT, you are shipping UX regressions you cannot see.

The AI Changelog Problem: Why Your Prompt Updates Are Breaking Other Teams

· 11 min read
Tian Pan
Software Engineer

A platform team ships a one-line tweak to the system prompt of their summarization service. No code review, no migration guide, no version bump — it's "just a prompt." Two weeks later, the legal product team finds out their compliance auto-redaction has been silently letting names through. The investigation eats a sprint. The fix is trivial. The damage is the trust.

This is the AI changelog problem in miniature. Behavior is now a first-class output of your system, and behavior changes when prompts, models, retrievers, or tool schemas change — none of which show up in git diff of the consuming application. Teams that treat AI updates like backend deploys, where a Slack message in #releases is enough, end up reinventing the worst parts of the early-2010s "we'll just push and tell QA later" workflow.

Why Your LLM Alerting Is Always Two Weeks Late

· 10 min read
Tian Pan
Software Engineer

Most teams discover their LLM has been degrading for two weeks by reading a Slack message that starts with "hey, has anyone noticed the AI outputs seem off lately?" By that point the damage is done: users have already formed opinions, support tickets have accumulated, and the business stakeholder who championed the feature is quietly losing confidence in it.

The frustrating part is that your infrastructure was healthy the entire time. HTTP 200s, 180ms p50 latency, $0.04 per request—everything green on the dashboard. The model just got quieter, vaguer, shorter, and more hesitant in ways that infrastructure monitoring cannot see.

This is not a monitoring gap you can close with more Datadog dashboards. It requires a different class of metrics entirely.

Annotation-Free Evaluation: Measuring LLM Quality Before You Have Ground Truth

· 12 min read
Tian Pan
Software Engineer

Most teams ship an LLM feature, then spend weeks arguing about whether it's actually good. The evaluation question gets deferred because building a labeled dataset feels like a separate project. By the time you have ground truth, you've also accumulated two months of silent regressions you can never diagnose. This is backwards. You can get a meaningful quality signal in week one — before a single annotation is complete — if you know which techniques to reach for and where each one breaks.

This post is a field guide to annotation-free evaluation: the reference-free methods that work, the conditions they require, and the specific failure modes that will fool you if you're not careful.

AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts

· 11 min read
Tian Pan
Software Engineer

The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.

This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).

Contract Tests for Prompts: Stop One Team's Edit From Breaking Another Team's Agent

· 9 min read
Tian Pan
Software Engineer

A platform team rewords the intent classifier prompt to "better handle compound questions." One sentence changes. Their own eval suite goes green — compound-question accuracy improves 6 points. They merge at 3pm. By 5pm, three downstream agent teams are paging: the routing agent is sending refund requests to the shipping queue, the summarizer agent is truncating at a different boundary, and the ticket-tagger has started emitting a category that no schema recognizes. None of those downstream teams were in the review. Nobody was on call for "the intent prompt."

This is not a hypothetical. It is what happens when a prompt becomes a shared dependency without becoming a shared API. A prompt change that improves one team's metric can silently invalidate the assumptions another team built on top. And unlike a breaking API change, there is no deserialization error, no schema mismatch, no 500 — the downstream just starts making subtly worse decisions.

Traditional API engineering solved this decades ago with contract tests. The consumer publishes the shape of what it expects; the provider is obligated to keep that shape working. Pact, consumer-driven contracts, shared schemas — this is release-engineering orthodoxy for HTTP services. Prompts deserve the same discipline, and most organizations still treat them like sticky notes passed between teams.

The Shadow Prompt Library: Governance for an Asset Class Nobody Owns

· 12 min read
Tian Pan
Software Engineer

Walk into almost any engineering org with a live LLM feature and ask a simple question: who owns the prompts? You will get a pause, then a shrug, then an answer that dissolves on contact. "Product wrote the first one." "The PM tweaked it last sprint." "I think it lives in a Notion doc, or maybe that const SYSTEM_PROMPT in agent.ts." The prompt is running in production. It shapes what users see, what actions the agent takes, what numbers show up in next quarter's revenue chart. And it has less governance surface than the CSS file nobody admits to touching.

This is the shadow prompt library: the accumulated pile of strings — system prompts, few-shot exemplars, tool descriptions, routing rules, evaluator rubrics — that collectively define product behavior and that collectively have no code review, no deploy pipeline, no owner, no deprecation policy, and no audit trail. They are the most load-bearing artifact in your AI stack and the least supervised.

The consequences are no longer theoretical. Ninety-eight percent of organizations now report unsanctioned AI use, and nearly half expect a shadow-AI incident within twelve months. Regulators are catching up faster than governance is: the EU AI Act's high-risk provisions apply in August 2026, and Article 12 is explicit that logs tying outputs to prompts and model versions must be automatic, not aspirational. If your prompts are scattered across a dozen codebases and a Slack thread, you do not have an audit trail; you have a liability.