Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Agent Debugger Has No Breakpoints: Why Trace-First Workflows Replace Step-Through

· 10 min read
Tian Pan
Software Engineer

The first time you try to debug an agent the way you'd debug a service, you discover that the muscle memory has nothing to grip. You set a hypothetical breakpoint — there's no IDE pane to put it in, but you imagine one — at the step where the planner picked the wrong tool. You rerun with the same input. The planner picks the right tool this time. You rerun again. It picks a third tool you've never seen before. The bug is real, your colleague reproduced it twice this morning, and the debugger you've used for fifteen years is suddenly a museum piece.

The mental model that breaks here isn't "use a debugger." It's the much deeper assumption underneath: that a program, given the same inputs, produces the same execution. Every affordance in a modern debugger — breakpoints, step-over, watch expressions, conditional breaks, hot reload — is built on top of that determinism. You pause execution because pausing is meaningful. You step forward because the next step is knowable. You inspect a variable because its value is a fact, not a draw from a distribution.

The AI Accessibility Audit Nobody Runs

· 11 min read
Tian Pan
Software Engineer

Open your agent product, turn on VoiceOver, and hit send on any prompt. If you have a typical streaming UI with an inline reasoning trace, what you will hear in the next thirty seconds is not your product. It is a torrent of partial tokens, mid-word reflows, status changes nobody announced, and a reasoning monologue your sighted users opted into but your blind users cannot escape. The interface that demoed beautifully on stage is, to a screen reader, a denial-of-service attack delivered as speech.

This is the audit nobody on the AI team runs. The design review approved the streaming animation. The eval suite measured answer quality. The latency dashboard tracked time-to-first-token. None of those instruments noticed that the affordance making the product feel fast and thoughtful for one cohort makes it unusable for another. And that omission is starting to show up in pro-se lawsuit filings — the same federal courts that have been processing accessibility complaints against e-commerce sites for a decade are now seeing AI-interface complaints rise sharply, with one tracker reporting a 40% year-over-year increase in 2025 alone.

The AI Feature Sunset Playbook Nobody Writes

· 13 min read
Tian Pan
Software Engineer

Every AI org has a graveyard. Not of services — those get a runbook, a deprecation banner, a 30-day migration window, and a slot on the platform team's quarterly roadmap. The graveyard is of features: the smart-summary beta that never graduated, the auto-categorizer that two enterprise customers actually built workflows around, the agentic flow that demoed beautifully and shipped behind a flag that nobody flipped off. The endpoint is easy to deprecate. The four other things attached to it — the prompt, the judge, the regression set, and the incident memory — are what actually take a quarter, and nobody on the team has written the playbook because nobody has been promoted for retiring something.

This is the gap. Most of the public discourse on "model deprecation" is about vendor-side retirements: GPT-4o leaves on a date, Assistants API beta sunsets on August 26, DALL-E 3 retires on May 12, and your platform team has a notification period to migrate. That problem has playbooks because vendors publish dates, because the migration is forced, and because the work fits in a sprint. The internal version — when you decide a feature you built didn't graduate, and you have to actually take it out — has none of those forcing functions. The deprecation date is whatever you say it is. The migration path is whatever you build. And the artifacts you have to retire are not a single endpoint but a tangled stack of model-adjacent assets that your monitoring barely knows exist.

The AI Told Me So Defense: When Code Review Quietly Stops Pushing Back

· 11 min read
Tian Pan
Software Engineer

The single most expensive sentence in a 2026 code review thread is "the agent wrote it this way." Not because it's wrong — sometimes it isn't — but because it ends a conversation that used to start one. The reviewer types a question, the author quotes the model's reasoning back at them, and the thread resolves before anyone has actually argued about the change. The social cost of disagreeing with a confident, well-spoken model has quietly become higher than the cost of merging a subtle bug, and most teams won't see the trade in their metrics for another two quarters.

This is not a story about whether AI writes good code. It writes code, some of it good. This is a story about what happens to a quality gate when the friction at composition time collapses. Review velocity rises, defect rate rises in lockstep, and the correlation isn't obvious because nobody is tracking review-time-to-defect with the author class attached. The senior engineer who used to be the gravity well of taste in the codebase becomes the lone holdout in a culture quietly recalibrating around model deference.

The Composability Tax: Why Adding Tools Makes Your Planner Worse

· 9 min read
Tian Pan
Software Engineer

The team starts with five tools and a planner that hits the right one 95% of the time on production traffic. Eighteen months later they have fifty-one, the planner is sitting at 26%, and the simple cases the original five handled cleanly — book a meeting, look up a customer, file a ticket — now sometimes route to the wrong tool because there are three plausible-sounding lookalikes in the catalog. Nobody decided to make the planner worse. Every tool addition was individually defensible. The cumulative bill is the composability tax, and it is paid by every product whose tool catalog grows without a retirement discipline.

The tax is a curve, not a cliff. The Berkeley Function Calling Leaderboard measured it directly: on calendar scheduling, accuracy fell from 43% with four tools to 2% with fifty-one across multiple domains. On customer-support style tasks, GPT-4o dropped from 58% (single domain, nine tools) to 26% (seven domains, fifty-one tools). Llama-3.3-70B went from 21% to 0% over the same expansion. The shape repeats across models and task types: every additional tool moves the planner down the curve, and the marginal damage gets worse as the catalog gets larger because new entries are increasingly indistinguishable from incumbents.

The Customer-Facing AI Postmortem When Nothing Crashed

· 12 min read
Tian Pan
Software Engineer

Your status page is green. Your error rate is zero. Your uptime dashboard reads 100% for the seventh consecutive month. And yet at 9:14 AM on a Tuesday, your account team is forwarding you a message from a Fortune 500 customer that says, "Our team noticed the assistant has been worse this week. Can you tell us what changed?" Twelve more like it land before lunch. None of them will be answered by the existing incident-comms playbook, because that playbook was built for outages, and nothing has crashed.

This is the customer-facing AI postmortem problem, and it is the single most consistent gap I see across teams shipping LLM features into enterprise contracts. The reliability surface has shifted from "is it up" to "is it as good as it was last week," and almost none of the comms infrastructure has caught up. Status pages don't have a tile for it. Severity rubrics don't grade it. Support macros default to "we identified an issue and resolved it," which reads as either dismissive or alarming depending on the customer's mood that day.

The Demo Account Eval Set Your Sales Team Is Running Without You

· 10 min read
Tian Pan
Software Engineer

The most expensive eval set in your company isn't in your repo. It's in a slide deck a sales engineer assembled six months ago, plus three demo accounts named after your top-five logos, plus a half-remembered script that says "click here, ask the agent to summarize last quarter, watch the magic happen." It runs once or twice a week, in front of prospects worth six or seven figures. Nobody on the AI team has ever scored a run.

Then you ship a model migration on a Tuesday. On Thursday at 4 PM, the sales engineer pings the on-call channel: the summary output now starts with "Certainly! Here is a summary…" instead of jumping into the bullet points, the numbers are spelled out instead of digits, and the prospect — a Fortune 500 CFO who scheduled this meeting four weeks ago — just asked whether the product is always this chatty. The release notes called it a 1.2-percentage-point eval lift.

When Marketing Reads Your Eval Cases: The Cross-Functional Visibility Problem

· 11 min read
Tian Pan
Software Engineer

The eval set is the most-read artifact your AI team produces, and you almost certainly don't know who's reading it. The repo is private, the CI job is internal, the file is one directory above prompts/ — and yet a sales engineer scraped six cases for a demo last quarter, a marketing analyst pulled three failure cases into a "look how robust our system is" deck, customer success cited eval pass-rates verbatim in a renewal call, and product treats the file as the hidden spec the AI team won't share. The case files are read by more people than the code that generated them, and nobody on the AI team has noticed.

This isn't a permissions failure. The eval set is on the same Git server as the rest of the codebase, with the same access controls as every other engineering artifact. The problem is that the AI team is the only group that treats the eval set as code. Everyone else treats it as documentation, as marketing material, as a product spec, or as a customer complaint log — and each of those readings extracts a different slice of the same file, packages it for a different audience, and ships it somewhere the AI team isn't watching.

The First 90 Days for an AI Engineer: An Onboarding Playbook That Survives the Six-Week Doc Rot

· 12 min read
Tian Pan
Software Engineer

The new hire opens the onboarding doc. It points at a service architecture diagram from eleven months ago, a Confluence page titled "Our LLM Stack" last edited in October, and a Notion table of "model providers we use." Nothing in any of these documents tells them which prompt was tuned against which failure mode, which eval cases were added after which incident, which judge was recalibrated when the model bumped from 4.5 to 4.6, or why the system prompt for the support agent has a strange three-line preamble nobody wants to touch. Two weeks in, they ship a "small prompt cleanup" PR that removes the preamble. The eval suite passes. Production accuracy drops four points within a day.

The standard new-hire onboarding playbook — read the architecture doc, set up your laptop, do your first PR by week two — was built for engineers who join services. AI engineers join a different artifact. The thing they're going to be editing isn't a 5,000-line Go service that some staff engineer wrote; it's a 30-line prompt that survived eleven incidents and seventeen eval-driven rewrites, and the meaning of those thirty lines lives in the heads of two people on the team. Your onboarding doc cannot capture that, and trying to write a longer doc is the wrong fix.

GPU Capacity Is a Roadmap Constraint: The 18-Month Contract That Decided Q3

· 9 min read
Tian Pan
Software Engineer

Somewhere in your company, fourteen months ago, a finance director and a platform lead signed a multi-year accelerator commitment. They built a peak-load model from the prior quarter's telemetry, negotiated a discount of 40 to 70 percent off on-demand pricing, and locked in the cluster shape that your product roadmap now has to fit inside. Nobody on the product team was in the room. Nobody on the application engineering team saw the spreadsheet. The contract is binding, the discount only applies if the commitment is honored, and the capacity envelope it bought is now the silent ceiling on every Q3 feature your PMs are scoping.

The gap most teams don't notice until the second year: capacity contracts are roadmap decisions, but they're being made by people who don't see the roadmap, using inputs that don't include the roadmap. The product trio thinks it's choosing features from a clean priority backlog. Finance thinks it's optimizing a fixed envelope. Both are right inside their own frame, and the collision shows up in a planning meeting where an architect proposes a 70B-parameter model for the new assistant feature and the platform lead says, quietly, that the cluster is already at 85 percent and that model doesn't fit without crowding out something else.

The Internal Eval Set Is a Privacy Boundary Nobody Reviewed

· 11 min read
Tian Pan
Software Engineer

The dataset your AI team calls "the eval set" is, in most companies shipping LLM features, a collection of real customer conversations pulled from production logs. Nobody on the team thinks of it as a privacy event. The data never left the cluster. No new system was provisioned. No vendor was added. An engineer wrote a query, exported a few thousand traces into a labeling tool, and the team started grading model outputs against them. The legal team never heard about it because, from the inside, nothing changed — the same conversations that already lived in the same database were now also being read by a few engineers and a judge model.

That is the privacy boundary nobody reviewed. Customers gave you their messages so you could answer them. They did not give you their messages so you could measure your model against them. The two uses look identical at the storage layer and feel identical at the inference layer, but they are different processing purposes under every modern privacy regime — and the gap between the two is where the next round of compliance pain is going to land.

Locale-Stratified Evals: How to Catch Non-English Regressions Your English Test Set Can't See

· 12 min read
Tian Pan
Software Engineer

Your aggregate eval score is up 1.2 points after the last prompt change. Your CSAT on French queries dropped four points the same week. Both numbers are correct. The reason they disagree is that the eval set is 88% English, 6% Spanish, and the rest is a long tail none of which sees enough traffic to move the rollup. The French regression is in your data — it is just sitting at three decimal places below the noise floor of your top-line metric.

This is the most common shape of locale drift I see in production AI systems: not a sudden collapse, not a translated-string bug, but a steady performance gap that the rollup hides and the support queue eventually surfaces. By the time someone in the Paris office forwards a screenshot, you have shipped two more prompt changes on top of the regression and the bisect costs three engineering days.