Skip to main content

678 posts tagged with "ai-engineering"

View all tags

Your AI Pricing Page Is a Leveraged Bet on Token Economics

· 9 min read
Tian Pan
Software Engineer

When the team published the AI tier at "$X per seat for unlimited AI," nobody on the pricing call thought of it as a derivative position. It looked like a SaaS pricing page — a number, a tier, a CTA. But every dollar of revenue from that page is now exposed to a token-cost curve set by a vendor whose roadmap does not care about your gross margin. You did not write a pricing page. You wrote a naked short on token volatility, and the strike is whatever your vendor charges next quarter.

The math arrives quickly. A handful of power users discover the workflow and start running it on the longest context they can fit. A competitor's UX change re-trains the median user to send queries that are 40% longer. The frontier model your feature is locked to gets a price-per-million bump because the older tier you were on is being deprecated. Any one of these is a margin event you cannot reverse from the pricing page in a single quarter — and they tend to arrive together.

The AI Risk Register: What Your CRO Will Demand the Morning After

· 12 min read
Tian Pan
Software Engineer

The morning after the first six-figure agent incident, the directors will not ask whether the model was state-of-the-art. They will ask to see the row in the risk register that named this scenario, the owner who signed off, and the date the board last reviewed it. If your enterprise risk register has lines for cyber, vendor, regulatory, and operational risk, but no row for "an autonomous agent took an action under our credentials that produced a customer-visible loss," you are about to spend a board meeting explaining why the artifact every other category of risk merits did not exist for the one that just lost you money.

This is not a hypothetical anymore. Gartner projects that more than a thousand legal claims for harm caused by AI agents will be filed against enterprises by the end of 2026. AI-related risk has moved from tenth to second on the Allianz Risk Barometer in a single year. Insurers are now asking, in D&O renewal questionnaires, how the board has integrated AI into the corporate risk register and how third-party agentic exposures are being tracked. The line items below are what a defensible answer looks like, and the cadence the AI feature owner has to defend them on.

The 'Try a Bigger Model' Reflex Is a Refactor Smell

· 10 min read
Tian Pan
Software Engineer

A regression lands in standup: the support agent answered three customer questions wrong overnight. Someone says, "let's try Opus on this route and see if it fixes it." Forty minutes later the eval pass rate ticks back up, the team closes the ticket, and the inference bill quietly tripled on that path. Six weeks later the same shape of regression appears on a different route, and the same fix is applied. Your team has just trained a Pavlovian reflex: quality regression → escalate compute. The bigger model is the most expensive debugging tool in your stack, and you're now reaching for it first.

The trouble isn't that bigger models don't help. They do — sometimes a lot. The trouble is that bigger models are a strictly dominant masking strategy. When the prompt has a conflicting instruction, the retrieval is returning stale chunks, the tool description is being misread, or the eval set doesn't cover the failing distribution, a more capable model will round the corner of the failure without fixing any of those things. The next regression has the same root cause, the bill has compounded, and the underlying system is more brittle, not less, because the slack created by the upgrade kept anyone from looking under the hood.

Browser-Native AI Is a Per-Feature Decision: Four Axes Your Team Hasn't Priced

· 12 min read
Tian Pan
Software Engineer

The model-in-the-tab story used to be easy to dismiss: small models, novelty demos, a cute Whisper transcription that ran for thirty seconds before the laptop fan turned on. That story is dead. Quantization improved, WebGPU shipped in every major browser, on-device caches got a persistent quota, and 4-bit 3B models now stream tokens at a rate users perceive as "snappy" on a $500 laptop. The "should this run server-side?" question is no longer a default — it is a load-bearing architectural decision your product team is making by accident every time they accept the platform team's first answer.

The mistake that follows is bigger than the demo getting worse. Teams pick one backend — usually server inference, sometimes browser inference — for the entire product, and then pay the wrong tax on every feature that doesn't fit. The privacy-sensitive feature loses to the latency-sensitive one because the architecture forces a single answer. Or worse, the team picks browser-native because the demo was magical, then ships a fleet experience where 30% of users on the long-tail device population get a degraded product the dashboard can't see.

Browser-native AI is not a faster TensorFlow.js. It is a different runtime with a different SRE story, a different cost model, and a four-axis trade-off that does not collapse into a single answer. Treating it as "the cheap version of the API call" is the architectural mistake of 2026.

Cost-Per-Correctness, Not Cost-Per-Token: The Unit Metric Your Bill Won't Tell You

· 11 min read
Tian Pan
Software Engineer

A team I know cut their inference bill 40% last quarter by migrating their support-email triage flow from a frontier model to a mid-tier one. The CFO sent a thank-you note. Six months later, customer support headcount was up two FTEs and average resolution time had risen 35%. Nobody connected the dots, because the dots lived in different dashboards: the inference bill on the platform team's, the support load on the operations team's. The migration looked like a win on the only metric anyone was tracking. The metric was wrong.

This is the cost-per-token trap. Your invoice tells you what you spent on tokens. It cannot tell you what you spent per correct task, because the inference vendor has no idea what "correct" means in your domain. They sold you raw compute. You bought outcomes — or thought you did. The gap between those two units is where AI unit economics quietly comes apart, and the team that doesn't measure the right denominator is running half the equation and shipping the other half blind.

Cross-Team Agent SLAs Don't Compose: The 99% Math Your Org Forgot to Budget

· 11 min read
Tian Pan
Software Engineer

Team A's agent advertises a 99% success rate. Team B's agent advertises 99%. The new joint workflow that calls both lands at 98% on a good day, 96% on a bad one — and the team that owns the joint workflow is now the de facto SRE for two systems they don't own, can't reproduce locally, and didn't write the eval set for. Each upstream team is hitting its SLO. The composite product is missing its SLO. Nobody's pager is ringing on the right side of the boundary.

This is the math of independent failure rates, and it has been hiding in plain sight ever since the org started letting agents call each other. Five components at 99% reliability give you 95% end-to-end. Ten components give you 90%. A 20-step process at 95% per-step succeeds 36% of the time — more than half of operations fail before completion. By the time a workflow chains 50 components — not unusual once an enterprise agent starts calling sub-agents that call tool agents — a system where every individual piece is "99% reliable" will fail roughly four out of ten requests.

Researchers analyzing five popular multi-agent frameworks across more than 150 tasks identified failure rates between 41% and 87%, with the top three failures being step repetition, reasoning–action mismatch, and unawareness of termination conditions — and unstructured multi-agent networks have been observed to amplify errors up to 17× compared to single-agent baselines. The math isn't subtle. The problem is that the org's SLO sheets, dashboards, on-call rotations, and PRDs are still scoped one agent at a time.

The Eval Bottleneck: Your Eval Engineer Is Now the Roadmap

· 11 min read
Tian Pan
Software Engineer

The constraint on your AI roadmap isn't GPU capacity, model availability, or prompt-engineering taste. It's the calendar of one or two engineers who actually know how to build an eval that catches a regression. Every PM with a feature is in their queue. Every model upgrade is in their queue. Every cohort drift, every prompt revision, every "is this judge still calibrated" question lands in the same inbox. And the engineer in question said "no, this isn't ready" three times this quarter, got overruled twice, watched the regression compound in production, and is now updating their LinkedIn.

This is the eval bottleneck, and most orgs don't see it until it bites. Through 2025 the visible scaling story was AI engineers — hire AI engineers, ship AI features, iterate on prompts, swap models. By Q1 2026 the throughput problem moved one layer down. The team that doubled its AI headcount discovered that adding more feature engineers didn't make features ship faster, because every feature still needed an eval, and the eval engineer was the same person.

Eval Differential as Branch Protection: Ship Score Diffs, Not Score Floors

· 10 min read
Tian Pan
Software Engineer

A team I worked with had a clean-looking eval gate: every prompt PR had to score above 0.85 on the golden set or the merge button stayed grey. They were proud of it. Six weeks in, average quality had quietly drifted from 0.93 to 0.87 — every PR cleared the bar, every PR landed, and no individual change owned the regression because none of them broke the rule. The bar was set against a snapshot of last quarter's quality, not against last week's.

That's the failure mode of an absolute-threshold eval gate: a PR that drops the score from 0.92 to 0.86 ships green, while a PR that lifts the score from 0.80 to 0.84 fails the same gate. The team learns "ship if it clears the bar" — a quality story. The signal you actually want is "ship if this change is non-regressive on the slices that matter" — a regression-detector story.

Coverage tools figured this out a decade ago. They report the diff against the parent commit and they break it down per file. Eval gates haven't caught up.

The Eval-Set Poison Pill: When Your Benchmark Becomes a Backdoor

· 10 min read
Tian Pan
Software Engineer

A team I know spent six months chasing a regression that wasn't there. Every release passed the eval. Every release shipped. Every quarter, NPS on the AI-served cohort drifted down a point. Eventually, an intern doing a routine audit of the gold dataset noticed that one labeler — long since rotated off the contract — had graded 11% of the items, and that those items were systematically more lenient on a specific failure mode the team had been racing to fix. The eval said the model was getting better. The model was not getting better. The eval had been quietly tilted by one human's calibration drift, and nobody had been watching the labelers because nobody believed the labelers were a threat surface.

This is the eval-set poison pill. Most teams treat their eval set as a trusted artifact: the labels were graded by humans, the data came from production, and the regression dashboard is the one thing the org agrees to defer to when shipping. But the labeling pipeline is a human supply chain, and human supply chains are gameable. Treating an eval as ground truth without applying supply-chain hygiene to its inputs is trusting a number whose provenance you cannot defend.

Your Gold Eval Set Has Drifted and Its Pass Rate Is the Reason You Can't See It

· 12 min read
Tian Pan
Software Engineer

The gold eval set passes at 94%. The model has been bumped twice this quarter, the prompt has been edited eleven times, the tool catalog has grown by four, and the dashboard is still green. Then a sales engineer forwards a transcript where the agent confidently routes a customer to a workflow that was sunset two months ago, and the head of support quietly opens a thread asking why the satisfaction scores have been sliding for six weeks while the eval pipeline reports no regressions. The gold set isn't lying. It's measuring last quarter's product against this quarter's traffic, and nobody asked it to do anything else.

This is the failure mode evaluation systems make hardest to see, because the instrument that's supposed to detect quality regressions is itself the source of the false positive. Pass rate is computed against the items in the set; the items in the set were curated against a snapshot of usage; usage moved on; the rate stayed clean. The team trusts the green dashboard, ships another model upgrade, and discovers months later that the production distribution has been measuring something different than the eval set has been measuring for longer than anyone wants to admit.

The fix is not to refresh the gold set more often. Refresh cadence is the wrong knob; the right knob is having a second instrument calibrated to a different time window so disagreement between the two surfaces drift before users do. That second instrument is the shadow eval — a parallel set rebuilt continuously from current production traffic, run alongside the gold set, with the explicit job of disagreeing with it.

The Human Attention Budget Is the Constraint Your HITL System Silently Overspends

· 10 min read
Tian Pan
Software Engineer

The 50th decision your reviewer makes this morning is not the same quality as the first. The architecture diagram does not show this. The capacity model does not show this. The dashboard tracking "approvals per hour" actively hides it. And yet the entire premise of your human-in-the-loop system — that a person catches what the model gets wrong — is silently degrading from the moment the queue begins to fill.

Most HITL designs treat reviewer time as an infinite, fungible resource. The team sets a confidence threshold, routes everything below it to a human queue, and declares the system "safe." Six weeks later, the approval rate has crept up to 96%, the queue is twice as deep as the staffing model assumed, and a sample audit shows that reviewers are clicking "approve" on edge cases they would have flagged on day one. The system has not failed. It has rubber-stamped its way into looking like it is working.

The Idle Agent Tax: What Your AI Session Costs While the User Is in a Meeting

· 11 min read
Tian Pan
Software Engineer

A developer opens their IDE copilot at 9:00, asks it three questions before standup, and then sits in meetings until 11:30. The chat panel is still open. The conversation is still scrollable. The model hasn't generated a token in two and a half hours. And yet that session — sitting there, attended by nobody — has been quietly accruing cost the entire morning. KV cache pinned. Prompt cache being kept warm by a periodic ping. Conversation state held in a hot store. Trace pipeline writing one row per heartbeat. Concurrency slot reserved on the model provider. Multiply by ten thousand seats and the bill is real.

This is the idle agent tax. It is the part of your inference budget that pays for capacity your users are not using, and it is invisible to most engineering dashboards because the dashboards were built for stateless APIs. A request comes in, a response goes out, the box closes. Done. Agentic products broke that model two years ago and most teams have not yet repriced their architecture around it.