Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Your Embedding Model Choice Sets the Ceiling Your LLM Can't Raise

· 11 min read
Tian Pan
Software Engineer

A team I was advising had spent two months swapping LLMs in their RAG pipeline. Claude, GPT, Gemini, then back again. Each swap shaved a few percentage points off hallucination rate but never moved the needle on the metric that mattered: their support agents still couldn't find the right knowledge base article more than 60% of the time. They were tuning the wrong layer. The retriever was returning irrelevant chunks, and no amount of LLM cleverness can answer a question from documents the retriever never surfaced.

The embedding model is the part of a RAG system that decides what the LLM is even allowed to see. It draws the geometry of your corpus — which documents land near which queries in vector space. Once that geometry is wrong, the LLM is just a confident narrator of bad context. Swapping it for a smarter one usually makes the answers more articulate, not more correct.

Eval Set Rot: Why Your Score Trends Up While Users Trend Down

· 10 min read
Tian Pan
Software Engineer

The eval score has been trending up for two quarters. The dashboard is green, the regression suite has not flagged a real failure since March, and the team has gotten faster at shipping prompt changes because the eval gives crisp pass/fail answers. Meanwhile, user-reported quality is sliding. NPS is down four points, the support queue is full of failure modes nobody has labels for, and the head of product has started asking why the evals look great if customers are angry.

The eval set is not lying. It is answering the question it was built to answer, six months ago, against the traffic distribution that existed in launch week. The product has shifted. The user base has shifted. The long-tail use cases the team did not anticipate at launch now make up a third of traffic. The eval set is still measuring the world that existed in week one, and the team is averaging today's model against yesterday's product.

This is eval set rot. It is one of the quietest failure modes in modern AI engineering, and it gets worse as the eval set gets bigger, because the people maintaining it confuse "more cases" with "better coverage."

Why Your Prompt Library Should Be a Monorepo, Not a Cookbook

· 11 min read
Tian Pan
Software Engineer

A team I worked with recently had three different "summarize this contract" prompts. One lived in a Notion page that the legal-tech squad copy-pasted into their service. One lived in a prompts/ folder in the customer-success backend, slightly modified to handle their tone preferences. One lived inline in a Python file inside the data team's notebook, hardcoded between two f-string interpolations. When OpenAI deprecated the model they all ran on, the migration plan involved Slack archaeology — each owner had to be tracked down, each variant had to be re-evaluated, and two of the three subtly broke in production for a week before anyone noticed.

This is what a prompt cookbook looks like at scale. Cookbooks make sense for ten prompts and one team. They become unmanageable somewhere around a hundred prompts and four teams. By the time you're running an AI organization, your prompts/ folder of .md files behaves exactly like vendored copy-paste code from 2008: every consumer has its own snapshot, drift is invisible, and breaking changes ripple outward in unpredictable ways.

Agent Disaster Recovery: When Working Memory Dies With the Region

· 12 min read
Tian Pan
Software Engineer

The DR runbook your team rehearses every quarter was written for a stack you no longer fully run. It says: promote the replica, repoint DNS, drain the queue. It assumes state lives in databases, queues, and object storage — places the SRE org has owned, named, and tested for a decade. Then last quarter you shipped an agent. Working memory now lives in the inference provider's session cache, scratchpad files on a worker's local disk, in-flight tool results that haven't been written back, and a partial plan-and-act trace that exists only in the prompt history of one model call. None of that is on the asset register. None of it is in the runbook.

When the region drops, the agent doesn't fail cleanly. It half-completes. The user sees a workflow that started but the failover region cannot resume, the customer's invoice gets sent twice or not at all because the idempotency key lived on the dead worker, and the on-call engineer reads a Slack thread that begins "the orchestrator is up, but..." and ends six hours later with a credit-card chargeback queue.

This is the gap nobody named: agentic features have a state model the existing DR plan doesn't describe. The team that hasn't written that state surface down is one regional outage away from learning what their runbook's silence costs.

Agent Incident Forensics: Capture Before You Need It

· 11 min read
Tian Pan
Software Engineer

The customer sends a screenshot to support on a Tuesday. Their account shows a refund posted six days ago that they never asked for. Your CRO forwards the screenshot with one question: "What produced this?" You know an agent did it — the audit log says actor: refund-agent-v3. But the prompt has been edited four times since. The model id rotated last Thursday when finance switched providers to chase a 12% cost cut. The system prompt is templated from three retrieved documents, and the retrieval index was reindexed Monday. The conversation history was trimmed by the runtime to fit a smaller context window.

You can tell the CRO the agent did it. You cannot tell them why. That gap — between knowing an action happened and being able to reconstruct the inputs that caused it — is the gap most agent teams discover the first time someone outside engineering asks a real forensic question.

Your Agent Release Notes List Files. Your Integrators Need Behavior Diffs.

· 13 min read
Tian Pan
Software Engineer

A platform team ships their weekly agent release on a Wednesday afternoon. The internal changelog is dutiful: three system-prompt commits, a model-alias bump from a -0815 snapshot to -1019, four edits to tool descriptions, a new eval-rubric weighting, and a refreshed retriever index. By Friday, the support queue has eighteen tickets that nobody on the platform team can pattern-match. Tickets two and seven say "the bot is suddenly refusing to summarize private repos." Ticket eleven says "every code block in the output now starts with a language tag, and our downstream parser breaks on it." Ticket fifteen says "tool X is being called twice as often on long inputs and we're hitting our rate limit."

None of these tickets reference any of the lines in the changelog. The platform team's release notes are a list of files moved. The integrator tickets are a list of behaviors changed. The two documents do not meet in the middle, and that gap is where the trust leaks out.

Your Prompts Ship Like Cowboys: Why Code Review Discipline Doesn't Extend to AI Artifacts

· 11 min read
Tian Pan
Software Engineer

Walk through any mature engineering team's PR queue and you will see the same thing: a four-line bug fix attracts three rounds of comments about naming, error handling, and missing test coverage, while a forty-line edit to the system prompt sails through with a single "LGTM, ship it." The author shrugged because the diff looks like documentation. The reviewer shrugged because they have no mental model of what "good" looks like inside that block of English. The result is a prompt change with the blast radius of a feature launch, reviewed at the bar of a typo fix.

This is the quiet quality crisis of every team building with LLMs in production. The codebase has decades of accumulated discipline — linters, type checks, code owners, test gates, deploy windows. The artifacts that actually steer the model — the system prompt, the eval rubric, the tool description, the few-shot exemplars — sit in the same repo and ship through a review process that was designed for English prose. So prompt regressions, eval-rubric drift, and tool-schema breakages land at a quality bar the team would never accept for code.

The Demo Was a Single Seed: Why Your AI Rollout Is a Variance Problem, Not a Polish Problem

· 11 min read
Tian Pan
Software Engineer

The exec demo went perfectly. The model answered the curated question, the agent completed the workflow, the screen recording is saved on the company drive, and the launch date is now in the calendar. Six weeks later the rollout craters and the post-mortem narrative writes itself: the model needed more polish, the prompt needed more iteration, the team underestimated the work between prototype and production.

That narrative is wrong, and it's expensive, because it sends the team back to do more of the work that already failed. The demo wasn't an under-polished version of production. It was a single sample from a distribution the team never measured. The wow moment was one realization out of thousands the model would generate against the same input, and the team shipped the best one as if it were the typical one. The gap between demo and prod isn't quality drift. It's variance the team hadn't yet seen.

This reframing matters because the fix for a variance problem looks nothing like the fix for a polish problem. Polish says "iterate the prompt, tune the model, hire a better PM." Variance says "you don't know what you have until you sample it n times across the input distribution." The two diagnoses produce different roadmaps, different budgets, and different incident patterns. The teams that ship reliably in 2026 know which problem they have.

The Hidden Edges Between Your AI Features: When One Prompt Edit Regresses Three Other Teams

· 9 min read
Tian Pan
Software Engineer

A platform engineer changes the opening sentence of the company's "house style" preamble — a single line that anchors voice across customer-facing assistants. The change ships behind a flag. By Tuesday, the search team's relevance regression has spiked, the support bot's eval pass-rate has dropped four points, and the onboarding agent's retry rate has doubled. None of those teams touched their own code. None of them got a heads-up. The platform engineer has no idea any of this happened, because nobody was on the receiving end of an alert that said "your edit just broke three downstream features."

This is the failure mode that defines the second year of an AI org's life. The first year, every team builds its own thing in a corner. The second year, those corners start sharing artifacts — a prompt fragment here, a seeded eval set there, a tool schema reused as a contract — and the moment that sharing becomes implicit, the dependency graph between AI features becomes invisible. You now have a distributed system whose edges no one can name.

The discipline that fixes this is not a new platform. It's drawing the graph.

Your AI Explainer Doc Is a Runtime Dependency, Not Marketing Copy

· 12 min read
Tian Pan
Software Engineer

A team I worked with last quarter shipped an AI assistant with a tidy stack of supporting documents: an in-product tooltip warning that the AI may produce inaccurate results, a help-center article titled "How does the assistant work," an internal support runbook for handling escalations, and a public model card listing the underlying model, the tools the assistant could call, and the data domains it covered. The launch went well. Six months later the prompt had been edited fourteen times, the model had been swapped from one tier to another with subtly different refusal behavior, two new tools had been added, one tool had been deprecated but not removed from the prompt, and the language settings had been opened from English-only to nine locales.

Every single one of those documents was wrong. Not catastrophically wrong — the kind of wrong where a sentence is half-true, a capability is described in language the model no longer matches, a refusal pattern is documented that the new model never triggers, a tool name appears in the help article that the assistant won't actually call. The kind of wrong that produces a slow drip of confused support tickets, a few customer trust regressions when the AI does something the docs say it won't, and — because the company sells into a regulated vertical — a small but real compliance gap that nobody on the AI team had thought to track.

Your AI Feature Ramp Is Rolling Out on the Wrong Axis

· 11 min read
Tian Pan
Software Engineer

A team I talked to last month ramped a new agentic feature from 1% to 50% of users over four weeks. Aggregate quality metrics held within noise. Latency stayed within SLA. They were preparing the 100% memo when the support queue caught fire — a customer with a six-tool research workflow had been getting silently corrupted outputs since the 10% step. The hard queries had been there the whole time, evenly sprinkled across every cohort, averaging into the noise floor. Nobody saw them until a single high-volume user happened to hit them at scale.

This is not a monitoring failure. It is a ramp-axis failure. Feature flag tooling — the entire LaunchDarkly / Flagsmith / Unleash / Cloudflare-Flagship category — assumes blast radius scales with the number of humans exposed. For deterministic software that is mostly true: a NullPointerException hits everyone or nobody, and showing it to 1% of users limits the user-visible blast to 1%. For AI features, blast radius does not scale on the human axis. It scales on the input axis. And the input axis is where almost no one is ramping.

AI Office Hours Don't Scale: When Your One Expert Becomes the Release Gate

· 11 min read
Tian Pan
Software Engineer

Open the calendar of the one engineer at your company who has shipped real AI features into production for more than six months. Count the recurring "30 min sync — questions about the agent" invites, the ad-hoc "can I grab you for 15?" Slack pings that ended up booked, the architecture-review attendances marked "optional" that they actually have to be at, and the office hours block that started as one Friday afternoon and now eats two hours every weekday. Then look at the roadmap and trace which features depend on a decision that engineer hasn't made yet. The intersection is your real release schedule. The Jira board is fiction.

This is the AI office hours bottleneck, and it is the load-bearing constraint inside more 2026 AI orgs than anyone in those orgs would say out loud. The team scaled AI feature work fast — every product squad got a model budget, every PM got a prompt — and routed every "is this the right model," "should we use RAG here," "is our eval design valid," "why is the cache hit rate weird" question to the one engineer who's actually shipped enough production AI to answer. Six months in, that engineer's calendar is the rate-limiting reagent for half the roadmap, and "I need to grab 30 minutes with them" is the load-bearing escalation path your incident response was supposed to make explicit.