Skip to main content

678 posts tagged with "ai-engineering"

View all tags

The Onboarding Gap: Why New Engineers Take Three Months to Touch the AI Stack

· 9 min read
Tian Pan
Software Engineer

A backend engineer with eight years of experience joins your team. By week three on a normal codebase, they would be shipping features. On the AI surface, they are still asking questions in DMs, and you can predict which two senior engineers they are asking. Three months in, they are finally trusted to edit the system prompt — not because the prompt is hard, but because nobody could tell them which evals would catch a regression and which would happily wave bad output through.

This is not a hiring problem or a documentation problem in the usual sense. AI codebases carry a hidden domain-knowledge tax that does not show up in code review, does not appear in the README, and is invisible to the static analyzer. The tax is paid in onboarding time, in repeated questions to the same two people, and eventually in a team that quietly bifurcates into "the people who can touch it" and "everyone else."

The AI Wallet: Why Token Budgets Belong in the UI, Not the Engineering Dashboard

· 10 min read
Tian Pan
Software Engineer

Pull up the per-user cost dashboard for any AI product on a flat subscription. The shape is always the same. A long, flat tail of users who barely move the needle, and a thin spike at the top where five percent of accounts burn eighty percent of the inference budget. The spike is hidden from users on both ends. The power users don't know they're subsidizing nothing — they assume the price is the price. The casual users don't know they could ask for more — they assume the limit is the limit.

The dashboard stays engineering-internal because product is afraid that exposing it will scare users. It does the opposite. The team that hides cost ends up shipping silent throttling, hidden model downgrades, and answer truncation that the user reads as "this product is broken." The team that exposes cost — as a deliberate UI surface, not an admin page — turns the same cost ceiling from a churn driver into a monetization lever.

This is the AI wallet. Not a billing page. A product primitive.

Compliance Reviewer as Eval Author: Why Legal Should Be Writing Your Test Cases

· 13 min read
Tian Pan
Software Engineer

The most useful adversarial prompt I have seen for an enterprise LLM did not come from a red team, a security researcher, or a prompt engineer. It came from a senior compliance attorney who asked the model, in plain English, to "tell me which of the three retirement annuities discussed earlier in this thread is the best one for a 62-year-old approaching their first required minimum distribution." The model produced a confident, thoughtful, beautifully-formatted recommendation. That output, had it been sent to a customer, would have been a textbook FINRA suitability violation — an unsuitable individualized recommendation made without the supervisory infrastructure that securities rules require around personalized advice.

The compliance attorney spotted the failure mode in about four seconds. The engineering eval suite, which had a hundred-plus carefully constructed cases for hallucination, refusal calibration, and tool-use accuracy, had no concept that this particular response shape was illegal. Not low quality. Not a hallucination. Illegal. And the workflow at the company at the time had her reading sample outputs in a Google Doc and writing memos, rather than checking a test case into the regression suite. So her catch lived in a memo, the memo got summarized in a launch-readiness slide, and the next month a refactor of the system prompt regressed the behavior because nobody had a failing test pinned to it.

That is the gap I want to argue we should close: the compliance reviewer should be authoring eval cases directly, and those cases should be the artifact that gates release — not the document review that produced them.

Conversational REST: When Your Chat UI Needs Pagination, Filters, and Sort

· 11 min read
Tian Pan
Software Engineer

A user asks your shopping agent for "running shoes under $150 with good arch support." The model dutifully returns twelve options as a wall of bulleted text inside a single chat bubble that overflows the viewport. The user scrolls, loses their place, and types "show me only Asics" — at which point your agent re-runs the entire search instead of filtering the result set it already has. Three turns later, the user is inventing a query language one prompt at a time, and your product feels like a command line wearing a chat-bubble costume.

This is the failure mode I keep watching teams ship. They built a chat product on top of what users actually wanted to be a faceted-search product. The model is fine. The retrieval is fine. The UI is the problem, and it's the wrong shape for the task.

The shortest way I can put it: chat is an input modality, not an output one. The agent's job is to translate user intent into a structured query. The moment the result set is more than three items, the right answer is to render UI, not to keep talking.

The Cost of Reversal: Why Pulling Back an AI Feature Is Harder Than Shipping One

· 10 min read
Tian Pan
Software Engineer

The release process you have was designed for a world where shipping is irreversible and rollback is free. AI flips that. Once a feature has been live for a quarter, the disruption cost of pulling it back exceeds the disruption cost of launching it — and the louder customer feedback you will ever get on that feature is the day you take it away, not the day it shipped.

The team builds a kill switch for every AI launch. Nobody ever pulls it. Not because the feature is flawless, but because by the time anyone wants to, the cost of doing so has compounded past anything the launch criteria considered. Feature flags assume the world is symmetric: the system before the flip and the system after the flip are equally valid resting points, and you can move between them as you please. AI features break that assumption silently, and the team's release process — built around reversible flags — quietly assumes the asymmetry away.

The first time the team notices is when somebody proposes deprecating the feature.

Cost-Per-Conversation as a Product Contract: When Pricing Drives Architecture

· 10 min read
Tian Pan
Software Engineer

The cleanest way to find out your AI feature's pricing model is wrong is to look at which engineer is currently rewriting the truncation logic at midnight. They aren't shipping a capability — they're patching a unit-economics leak that the PRD never named, and the patch is necessarily user-hostile because the product spec told them the budget was infinite. On a flat-fee SaaS plan, every conversation that runs longer than the median pulls margin out of the company in real time. The only real question is whether the product team admits it before finance does.

Traditional SaaS economics rest on near-zero marginal cost per user: once the software is built, serving the next customer barely moves the infrastructure line. AI features break that assumption. Every turn in a conversation consumes inference compute that scales with prompt size, output length, tool-call fan-out, and retrieval volume — and conversations don't have a natural stopping point. A heavy user can consume 50× the median in a billing period without leaving the happy path of the product. Under flat pricing, that user is funded by the rest of the user base, and the company finds out only when COGS reporting catches up a quarter later.

This is why pricing on AI features is not a finance problem to be handled after launch. It is an architecture input that decides what the product is allowed to do, and refusing to make it visible in the spec just means it gets resolved later, in worse ways, by people without product authority.

Your Eval Suite Is the Product Spec You Refused to Write

· 10 min read
Tian Pan
Software Engineer

Open the PRD for any AI feature shipping this quarter. Notice the adjectives. The assistant should be helpful. Responses should feel natural. The agent should understand the user's intent. The summary should be accurate and concise. Every one of these words is a place the team gave up. They did not decide what the feature does. They decided how they would describe the feature to each other in a meeting, then handed the actual product definition — quietly, without anyone calling it that — to whoever wrote the eval suite.

This is not a documentation problem. The eval is the spec. The PRD is a press release written before the product exists. The fuzzy adjectives in the doc become unambiguous behavioral assertions in the eval, or they become nothing — the model picks an interpretation, ships it, and the team discovers a quarter later that "concise" meant something different to the reviewer than to the user, and different again to whoever tuned the prompt last sprint. An AI feature whose eval suite is thin is a feature whose product definition is thin. The model didn't fail. The team never decided what success meant.

Forced Conformance Bias: When the Model Rounds Your Intent to the Distribution Mode

· 10 min read
Tian Pan
Software Engineer

A user asks for "a haiku about Postgres replication." The model returns a five-line poem about databases that mentions servers and synchronization, sounds confident, scans like English, and is not a haiku. A different user asks for "a regex that matches IPv6 addresses but explicitly rejects IPv4-mapped forms." The model returns a regex that matches IPv6 addresses, including the IPv4-mapped forms it was told to reject, and asserts in prose that the regex meets the spec. A third user asks for "an explanation of monads using only cooking metaphors, no mention of functions or types." The model gives a mostly-cooking explanation that uses the words "function" twice and "type" three times.

None of these is a refusal. None is an obvious hallucination. The model didn't say "I can't do that." It produced a confident, well-formed response that quietly relaxed the part of the request furthest from its training distribution mode, and the user has to be paying close attention to notice. The failure mode has a name worth using: forced conformance bias — the model rounds your intent toward the typical answer, the user reads the result as a faithful response, and the eval suite that should have caught it was itself drawn from typical phrasings.

This is not a model quality problem in the usual sense. The model is doing exactly what its training pushed it toward. It is a product reliability problem, and the team whose evals live at the mode of intent distribution is calibrating against the easy half of their actual workload.

The Frozen Prompt: When Your Team Is Afraid to Edit a System Prompt That Works

· 13 min read
Tian Pan
Software Engineer

Every mature AI product eventually grows a system prompt that nobody on the current team fully understands. It started as forty tokens of plain English, and twenty months later it is a 4,000-token wall of conditional clauses, refusal templates, formatting rules, persona reinforcements, edge-case warnings, and one peculiar sentence about Tuesdays that nobody can explain. Each line was added in response to a specific failure: a customer complaint, a Slack ping from legal, a regression caught by an eval, a one-off bug that surfaced during an investor demo. The engineer who wrote line 37 has rotated to another team. The engineer who wrote line 112 was a contractor whose Notion doc was archived. The eval suite covers maybe a third of the behaviors the prompt is asserting, and nobody is sure which third.

So the prompt becomes load-bearing in the worst possible way: it works, the team knows it works, and the team has stopped touching it. Engineers who should be iterating on the prompt route their changes around it instead — adding a post-processing filter here, a few-shot wrapper there, a parallel "v2 prompt" feature-flagged off in case anyone ever finds the courage to A/B test the replacement. The prompt has stopped being software and has become a relic. And once that happens, the prompt is no longer the lever you use to improve the product. It's the constraint shaping it.

The Internal-Tooling Agent: When Your Highest-Leverage AI Feature Has Zero Customers

· 10 min read
Tian Pan
Software Engineer

The most strategic AI investment in your company is probably a Slack bot one engineer built on a Friday afternoon. It answers "how do I get a staging credential" or "which on-call is responsible for the auth service" or "what's the runbook for a stuck deploy," and it has saved more engineering hours than the entire customer-facing AI roadmap that absorbs three quarters of your model spend, your safety review queue, and your launch comm bandwidth.

The org chart doesn't reflect this. The OKR doc doesn't reflect this. Nobody is the PM. Nobody is the EM. The bot survives because the engineer who built it still answers the GitHub issues, and the value compounds quietly while every customer-facing feature ships behind a six-week safety review and a launch readiness checklist that exists because the customer might churn.

Negative Prompts Are Code Smells: Why Every 'Don't' in Your System Prompt Is Technical Debt

· 10 min read
Tian Pan
Software Engineer

Open the system prompt of any production AI feature that has been live for more than three months. Count the negative clauses — the "do not," "never say," "avoid," "under no circumstances," "you must not." If the count is in the double digits, you are not looking at a system prompt. You are looking at a graveyard. Each tombstone marks a specific user complaint, a specific incident report, a specific Slack message from a stakeholder who saw the model do something embarrassing. The team patched the surface and moved on, and now the prompt reads like a legal disclaimer with a personality grafted onto the front.

Negative prompts are code smells. Not in the metaphorical sense — in the literal one. They are the prompt-engineering equivalent of a try/except block that swallows an exception, a config flag with no documentation, a // TODO: refactor this from 2022. They work, kind of, until they don't. And the failure mode they hide is almost always more interesting than the failure they were added to suppress.

The Phantom Skill: When Your Agent Demonstrates Capabilities You Never Tested For

· 11 min read
Tian Pan
Software Engineer

A customer posts a screenshot in your support channel. They've been using your scheduling agent to negotiate three-way meeting times across timezones in mixed English and Japanese, with the agent producing suggested slots in both languages and reasoning about Japanese business etiquette. It works. Leadership shares it on Slack with a fire emoji. The PM updates the marketing copy.

Nobody on the team wrote that capability. No eval covers it. No prompt instruction mentions Japanese, etiquette, or three-way coordination. The behavior is real, but it was never engineered, never measured, and is now in your product surface area.

This is a phantom skill: a capability your agent demonstrates that no test ever verified. It isn't a bug. It isn't quite a feature either. It's load-bearing behavior with no contract, and it's the failure mode that quietly defines what your "AI product" actually is.