Skip to main content

578 posts tagged with "insider"

View all tags

Your On-Call Rotation Needs an AI-Literacy Prerequisite Before It Pages Anyone at 2am

· 12 min read
Tian Pan
Software Engineer

A platform engineer with eight years of incident-response experience opens a 2am page that says "AI assistant degraded — error rate 12%." She checks the model latency dashboard: green. She checks the model API status page: green. She checks the deploy log: nothing shipped in the last 72 hours. She does what any competent on-call does next — she pages the AI team. The AI engineer wakes up, opens the trace dashboard the platform engineer didn't know existed, sees that a single retrieval tool has been timing out for the last four hours because a downstream search index lost a replica, and resolves the incident in eleven minutes. The AI engineer goes back to bed at 3:14am. The retrospective the next morning records "AI feature outage, resolved by AI team." Nobody writes down the actual lesson, which is that the on-call engineer could have triaged this in five minutes if she had ever been taught what an AI feature's failure surface looks like.

This is the rotation tax that AI features quietly impose on every engineering org I've worked with in the last two years. The shared on-call rotation that worked beautifully for a stack of stateless services and a few databases breaks down the moment one of those "services" is an LLM-backed feature. The on-call playbook your SRE team built across a decade of post-mortems is calibrated for a world where "something is broken" decomposes into CPU, memory, network, deploys, and dependency timeouts. AI features add three more axes — the model, the prompt, the retrieval pipeline — and four more shapes of failure that don't show up on the dashboards your on-call was trained to read.

On-Device AI Needs a Fleet Manager, Not a Model Card

· 12 min read
Tian Pan
Software Engineer

The on-device AI demo that shipped last quarter ran a single 4-bit Llama variant, ran it on a single test phone, and ran it well. Six months later, the same feature has a one-star tail of reviews complaining about heat, battery drain, or — worse — silent quality degradation that users only notice as "the AI got dumber on my old phone." The model didn't change. The fleet did. And the team that thought it was shipping a model has discovered, late, that it was actually shipping a fleet.

This is the gap that sinks most on-device AI launches: the strategy is built around picking the model, when the actual hard problem is delivering the right model to each device class, observing whether it's working, and rolling it back when it isn't. The discipline that closes that gap looks far more like CDN operations than like ML research — manifest-driven delivery, per-cohort telemetry, decoupled rollout channels, and a model-variant pipeline that produces N quantization tiers from one trained checkpoint. Most teams don't have any of that. They have a model card and a build artifact.

Per-Vector Version Tags: The Missing Column Behind Every Embedding Migration

· 10 min read
Tian Pan
Software Engineer

A new embedding model lands. The benchmark numbers are 4% better. A staff engineer files the ticket: "Upgrade embeddings to v3." Two weeks later the index has been re-embedded, the alias has been swapped, and the team has shipped the change behind a feature flag. Six weeks later, support tickets pile up. Search results "feel off." A retro is scheduled. Nobody can explain what regressed because nothing crashed and every dashboard is green.

The problem is not the model swap. The problem is that the vector store has no idea which vectors came from which model. There is no column for it. There is no migration table tracking which records have been backfilled. There is no alembic_version row, no schema_migrations table, no pg_dump of the previous state. The team treated an embedding upgrade like a config flip, and the vector store had no schema-level concept that would have stopped them.

Embedding migrations need the same artifact that database migrations have relied on for two decades: a per-record version tag, written into every vector, queried on every read, and used as the gating criterion for cutover and rollback. It is the single column most teams forget to add, and adding it later costs more than adding it up front.

Prompt Deprecation Contracts: Why a Wording Cleanup Is a Breaking Change

· 9 min read
Tian Pan
Software Engineer

A four-word edit on a system prompt — "respond using clean JSON" replacing "output strictly valid JSON" — once produced no eval movement, shipped on a Thursday, and was rolled back at 4am Friday after structured-output error rates went from 0.3% to 11%. The prompt did not get worse. It got different, and the parsers downstream of it had been pinned, without anyone noticing, to the literal phrase "strictly valid."

This is the failure mode that most prompt-engineering teams have not yet built tooling for: the prompt was treated as text the author owned, when it was in fact a contract with consumers the author never met. Some of those consumers are other prompts that quote the original verbatim. Some are tool descriptions whose JSON schema fields anchor on a particular adjective. Some are evals whose rubrics ask the judge to check for "the strictly valid format." And some are parsers — the most brittle category — whose regexes were calibrated to the exact preamble the model used to emit.

A "small wording cleanup" silently breaks parsers, shifts judge calibration, and invalidates weeks of eval runs. None of these failures show up on the PR. All of them show up on the dashboard a week later as drift.

The Customer Record Hiding in Your Few-Shot Prompt Template

· 11 min read
Tian Pan
Software Engineer

The privacy auditor's question came two days before the SOC 2 renewal: "Why is the email field in your onboarding prompt's example a real customer address?" The product team rebuilt the chain in their heads. A year earlier, when they shipped the AI summarizer, someone needed a "see how this works" example for the few-shot template. They picked a representative customer record from staging, scrubbed the obvious fields — name, account ID, phone — and committed the file. The customer churned six months later. Their record was deleted from the database per the data retention policy. Their record was not deleted from the prompt template, which had been shipped to every tenant in production.

The team had assumed, like most teams, that the privacy boundary was the database. The prompt template was code. Code goes through review. Review doesn't flag PII because reviewers aren't looking for it in YAML strings labeled example_input:. The DLP scanner that catches PII in Slack messages and email attachments doesn't scan committed code, and even if it did, it wouldn't recognize a partially-scrubbed customer record as personal data because the fields it knew to look for had been removed. Everything that remained — the company size, the industry, the rare job title, the specific city — was data the scanner had no rule for.

The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation

· 12 min read
Tian Pan
Software Engineer

The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.

Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.

The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.

The Structured-Output Retry Loop Is Your Hidden Compute Waste

· 11 min read
Tian Pan
Software Engineer

Pull up your structured-output dashboard. The number it proudly shows is something like "98.4% schema compliance." That's the success rate — the fraction of requests that produced a valid JSON object on the first try. The team built a retry wrapper for the other 1.6%, shipped it, and moved on. Two quarters later, the inference bill is up 15% on a request volume that grew by 4%. The CFO wants a story. The engineers don't have one, because the dashboard that tracks structured-output success doesn't track structured-output cost.

Here's the part the dashboard is hiding: the failure path is not a single retry. The first re-prompt fixes the missing enum field but introduces a malformed nested array. The second re-prompt fixes the array but drops a required key. The third pass finally validates, but by then the request has burned four full inference calls plus the original generation, and your per-request token meter shows the sum, not the loop. From the meter's perspective it's one expensive request. From the cost line's perspective it's a stochastic loop you never priced.

This post is about what that loop actually does to your compute budget, why your existing observability can't see it, and the disciplines that make it visible and bounded.

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

· 11 min read
Tian Pan
Software Engineer

Your sustainability dashboard reports "AI energy: 2.3 GWh this quarter, down 4% YoY" and the slide gets a polite nod in the ESG review. The CFO walks out of an analyst call six months later and asks the head of platform a question that sounds simple: "What is our token-per-watt, and how does it compare to our competitors?" The dashboard cannot answer. Not because the data is missing — the dashboard is full of data — but because it treats inference as a single line item and tasks as a product concept, and the only honest unit of AI sustainability lives at the intersection.

The mismatch is not a reporting bug. It is a category error that the existing carbon-accounting playbook, perfected for cloud workloads on CPU-hours and kWh per VM, cannot fix on its own. Inference is not a workload with a stable energy profile. The watts per token shift by 30× depending on which model tier served the request, by 4× depending on batch size at the moment of the call, and by another order of magnitude depending on whether the prefix cache hit or missed. Aggregating those into a single GWh number is like reporting "average car fuel economy" across a fleet that includes scooters, sedans, and 18-wheelers — accurate in the most useless sense.

Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists

· 11 min read
Tian Pan
Software Engineer

The agent took four hundred milliseconds to answer a simple question, then crashed with a recursion-limit error. The trace showed twenty-five tool calls. Reading the trace top-to-bottom, an engineer would conclude the agent was confused — calling the same handful of tools in slightly different orders, never converging. That conclusion is wrong. The agent wasn't confused. It was stuck in a cycle: tool A invoked the model, the model picked tool B, tool B's implementation invoked the model again to format its output, and the formatter chose tool A. The trace UI rendered four nested calls as four sibling calls in a flat list, and the cycle was invisible to the only human who could have caught it.

This is tool reentrancy, and it's a bug class your function-calling layer almost certainly doesn't model. Concurrency-safe code has decades of primitives for it: reentrant mutexes that count nested acquisitions by the same thread, recursion limits at the language level, stack inspection APIs, and a cultural understanding that any function which calls back into the runtime needs a clear contract about what re-entry is allowed. Tool-calling layers default to fire-and-forget. There is no call stack the runtime can inspect, no cycle detector before dispatch, no reentrancy attribute on the tool definition, and the trace UI is shaped like a log, not a graph. The result is that every tool catalog past about a dozen entries silently becomes a recursion the framework can't see.

Tool Schemas Are Prompts, Not API Contracts

· 11 min read
Tian Pan
Software Engineer

The most expensive line in your agent codebase is the one that auto-generates tool schemas from your existing OpenAPI spec. It looks like a clean engineering choice — single source of truth, no duplication, auto-sync on every API change. It is also why your agent picks searchUsersV2 when it should have picked searchUsersV3, fills limit=20 because your spec's example said so, and silently drops the tenant_id because it was buried in the seventh parameter slot.

Nothing about this shows up in unit tests. The schema validates. The endpoint exists. The agent's call is well-formed JSON. And yet the model uses the tool wrong, every time, in ways your QA pipeline never sees because it tests the API, not the agent's reading of the API.

The bug is conceptual. OpenAPI was designed to describe APIs to humans who write SDK code; tool schemas are read by an LLM at every single call as a piece of the prompt. Treating them as the same artifact is the same category mistake as auto-generating user-facing copy from your database column names.

Translation Is Not Localization: The Cultural-Calibration Debt Your Multilingual AI Just Defaulted On

· 12 min read
Tian Pan
Software Engineer

A multilingual launch that ships English prompts translated into N languages, with an English eval set translated into the same N languages, has not shipped a multilingual product. It has shipped one product N times, and made all the failure modes invisible to its own dashboards. The system is fluent and culturally off-key, and the metric the team optimized — translation quality — is the wrong axis to measure what users are reacting to.

The visible defect on launch day is small. A Japanese user receives a reply that is grammatically correct and conspicuously curt. An Indonesian user notices the assistant is cheerfully direct in a register that reads as rude. A Korean user gets advice framed around individual choice when the prompt was about a family decision. None of these are translation bugs. They are cultural-register bugs that translation cannot fix and translated evals cannot detect.

The Two-PM Problem: When Prompt Ownership and Product Ownership Drift Apart

· 11 min read
Tian Pan
Software Engineer

A support ticket lands on Tuesday morning: a customer was given a confidently wrong answer about their refund window. Engineering pulls the trace and finds the model picked the wrong intent. The product PM looks at the dashboard and sees the new "express refund" affordance — shipped last sprint — surfaced an intent the prompt was never tuned to handle. The platform PM points at the eval suite, which is green. Both are technically right. The customer is still wrong.

This is the two-PM problem, and most AI teams have it without naming it. The product PM owns the user-facing surface — intents, success metrics, the support escalation path. The platform or ML PM owns the prompt, the model choice, the eval suite, and the cost ceiling. The roadmaps are coordinated at the quarterly-planning level and drift at the weekly-shipping level, because the two PMs are optimizing for different metrics on different dashboards with different change-control processes.

The interesting failure mode isn't that the two PMs disagree. It's that they ship correctly relative to their own scope and still produce a regression nobody owns.