Skip to main content

89 posts tagged with "llm-ops"

View all tags

Time-to-First-Token Is the Latency SLO You Aren't Instrumenting

· 11 min read
Tian Pan
Software Engineer

Pull the last week of production traces and look at your latency dashboard. You almost certainly have p50 and p99 on total request latency. You probably have token throughput. You may even have a tokens-per-second chart, because a provider benchmark talked you into it. What you almost certainly do not have is a per-model, per-route, per-tenant histogram of time to first token — the single number that governs how fast your product feels.

This is not a small oversight. For any streaming interface — chat, code completion, agent sidebars, voice — perceived speed is set by how long the user stares at a blinking cursor before anything appears. Once the first token lands, the user is reading; subsequent tokens compete with their reading speed, not with their patience. Total latency matters for throughput planning and budget. TTFT matters for whether the product feels alive.

The gap between these two numbers is widening. Reasoning models can produce identical total latency to their non-reasoning siblings while pushing TTFT from 400 ms to 30 seconds. A routing change that "keeps latency flat" can silently turn a snappy assistant into a hanging window. If you are not graphing TTFT, you are shipping UX regressions you cannot see.

The AI Changelog Problem: Why Your Prompt Updates Are Breaking Other Teams

· 11 min read
Tian Pan
Software Engineer

A platform team ships a one-line tweak to the system prompt of their summarization service. No code review, no migration guide, no version bump — it's "just a prompt." Two weeks later, the legal product team finds out their compliance auto-redaction has been silently letting names through. The investigation eats a sprint. The fix is trivial. The damage is the trust.

This is the AI changelog problem in miniature. Behavior is now a first-class output of your system, and behavior changes when prompts, models, retrievers, or tool schemas change — none of which show up in git diff of the consuming application. Teams that treat AI updates like backend deploys, where a Slack message in #releases is enough, end up reinventing the worst parts of the early-2010s "we'll just push and tell QA later" workflow.

Why Your LLM Alerting Is Always Two Weeks Late

· 10 min read
Tian Pan
Software Engineer

Most teams discover their LLM has been degrading for two weeks by reading a Slack message that starts with "hey, has anyone noticed the AI outputs seem off lately?" By that point the damage is done: users have already formed opinions, support tickets have accumulated, and the business stakeholder who championed the feature is quietly losing confidence in it.

The frustrating part is that your infrastructure was healthy the entire time. HTTP 200s, 180ms p50 latency, $0.04 per request—everything green on the dashboard. The model just got quieter, vaguer, shorter, and more hesitant in ways that infrastructure monitoring cannot see.

This is not a monitoring gap you can close with more Datadog dashboards. It requires a different class of metrics entirely.

Annotation-Free Evaluation: Measuring LLM Quality Before You Have Ground Truth

· 12 min read
Tian Pan
Software Engineer

Most teams ship an LLM feature, then spend weeks arguing about whether it's actually good. The evaluation question gets deferred because building a labeled dataset feels like a separate project. By the time you have ground truth, you've also accumulated two months of silent regressions you can never diagnose. This is backwards. You can get a meaningful quality signal in week one — before a single annotation is complete — if you know which techniques to reach for and where each one breaks.

This post is a field guide to annotation-free evaluation: the reference-free methods that work, the conditions they require, and the specific failure modes that will fool you if you're not careful.

AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts

· 11 min read
Tian Pan
Software Engineer

The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.

This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).

Contract Tests for Prompts: Stop One Team's Edit From Breaking Another Team's Agent

· 9 min read
Tian Pan
Software Engineer

A platform team rewords the intent classifier prompt to "better handle compound questions." One sentence changes. Their own eval suite goes green — compound-question accuracy improves 6 points. They merge at 3pm. By 5pm, three downstream agent teams are paging: the routing agent is sending refund requests to the shipping queue, the summarizer agent is truncating at a different boundary, and the ticket-tagger has started emitting a category that no schema recognizes. None of those downstream teams were in the review. Nobody was on call for "the intent prompt."

This is not a hypothetical. It is what happens when a prompt becomes a shared dependency without becoming a shared API. A prompt change that improves one team's metric can silently invalidate the assumptions another team built on top. And unlike a breaking API change, there is no deserialization error, no schema mismatch, no 500 — the downstream just starts making subtly worse decisions.

Traditional API engineering solved this decades ago with contract tests. The consumer publishes the shape of what it expects; the provider is obligated to keep that shape working. Pact, consumer-driven contracts, shared schemas — this is release-engineering orthodoxy for HTTP services. Prompts deserve the same discipline, and most organizations still treat them like sticky notes passed between teams.

The Shadow Prompt Library: Governance for an Asset Class Nobody Owns

· 12 min read
Tian Pan
Software Engineer

Walk into almost any engineering org with a live LLM feature and ask a simple question: who owns the prompts? You will get a pause, then a shrug, then an answer that dissolves on contact. "Product wrote the first one." "The PM tweaked it last sprint." "I think it lives in a Notion doc, or maybe that const SYSTEM_PROMPT in agent.ts." The prompt is running in production. It shapes what users see, what actions the agent takes, what numbers show up in next quarter's revenue chart. And it has less governance surface than the CSS file nobody admits to touching.

This is the shadow prompt library: the accumulated pile of strings — system prompts, few-shot exemplars, tool descriptions, routing rules, evaluator rubrics — that collectively define product behavior and that collectively have no code review, no deploy pipeline, no owner, no deprecation policy, and no audit trail. They are the most load-bearing artifact in your AI stack and the least supervised.

The consequences are no longer theoretical. Ninety-eight percent of organizations now report unsanctioned AI use, and nearly half expect a shadow-AI incident within twelve months. Regulators are catching up faster than governance is: the EU AI Act's high-risk provisions apply in August 2026, and Article 12 is explicit that logs tying outputs to prompts and model versions must be automatic, not aspirational. If your prompts are scattered across a dozen codebases and a Slack thread, you do not have an audit trail; you have a liability.

The Model Deprecation Cliff: What Happens When Your Provider Sunsets the Model Your Product Depends On

· 8 min read
Tian Pan
Software Engineer

Most teams discover they are model-dependent the same way you discover a load-bearing wall — by trying to remove it. The deprecation email arrives, you swap the model identifier in your config, and your application starts returning confident, well-formatted, subtly wrong answers. No errors. No crashes. Just a slow bleed of trust that takes weeks to notice and months to repair.

This is the model deprecation cliff: the moment when a forced migration reveals that your "model-agnostic" system was never agnostic at all. Your prompts, your output parsers, your evaluation baselines, your users' expectations — all of them were quietly calibrated to behavioral quirks that are about to change on someone else's release schedule.

AI Technical Debt: Four Categories That Never Show Up in Your Sprint Retro

· 11 min read
Tian Pan
Software Engineer

Your sprint retro covers the usual suspects: flaky tests, that migration someone keeps punting, the API endpoint held together with duct tape. But if you're shipping AI features, the most expensive debt in your codebase is the kind nobody puts on a sticky note.

Traditional technical debt accumulates linearly. You cut a corner, you pay interest on it later, you refactor when the pain gets bad enough. AI technical debt compounds. A prompt that degrades silently produces training signals that pollute your evals, which misguide your next round of prompt changes, which further erodes the quality your users experience. By the time someone notices, three layers of assumptions have rotted underneath you.

Deterministic Replay: How to Debug AI Agents That Never Run the Same Way Twice

· 11 min read
Tian Pan
Software Engineer

Your agent failed in production last Tuesday. A customer reported a wrong answer. You pull up the logs, see the final output, maybe a few intermediate print statements — and then you're stuck. You can't re-run the agent and get the same failure because the model won't produce the same tokens, the API your tool called now returns different data, and the timestamp embedded in the prompt has moved forward. The bug is gone, and you're left staring at circumstantial evidence.

This is the fundamental debugging problem for AI agents: traditional software is deterministic, so you can reproduce bugs by recreating inputs. Agent systems are not. Every run is a unique snowflake of model sampling, live API responses, and time-dependent state. Without specialized tooling, post-mortem debugging becomes forensic guesswork.

Deterministic replay solves this by recording every source of non-determinism during execution and substituting those recordings during replay — turning your unreproducible agent run into something you can step through like a debugger.

Provider Lock-In Anatomy: The Seven Coupling Points That Make Switching LLM Providers a 6-Month Project

· 10 min read
Tian Pan
Software Engineer

Every team that ships an LLM-powered feature eventually has the same conversation: "What if we need to switch providers?" The standard answer — "we'll just swap the API key" — reveals a dangerous misunderstanding of where coupling actually lives. In practice, teams that attempt a provider migration discover that the API endpoint is the least of their problems. The real lock-in hides in seven distinct coupling points, each capable of turning a "quick swap" into a quarter-long project.

Migration expenses routinely consume 20–50% of original development time. Enterprise teams who treat model switching as plug-and-play grapple with broken outputs, ballooning token costs, and shifts in reasoning quality that take weeks to diagnose. Understanding where these coupling points are — before you need to migrate — is the difference between a controlled transition and an emergency scramble.

The Observability Tax: When Monitoring Your AI Costs More Than Running It

· 8 min read
Tian Pan
Software Engineer

Your team ships an AI-powered customer support bot. It works. Users are happy. Then the monthly bill arrives, and you discover that the infrastructure watching your LLM calls costs more than the LLM calls themselves.

This isn't a hypothetical. Teams are reporting that adding AI workload monitoring to their existing Datadog or New Relic setup increases their observability bill by 40–200%. Meanwhile, inference costs keep dropping — GPT-4-class performance now runs at 0.40permilliontokens,downfrom0.40 per million tokens, down from 20 in late 2022. The monitoring stack hasn't gotten that memo.

The result is an inversion that would be funny if it weren't expensive: you're paying more to watch your AI think than to make it think.