Skip to main content

861 posts tagged with "insider"

View all tags

The PR-Bot That Never Sleeps: When Your Reviewers Become the Rate Limiter

· 11 min read
Tian Pan
Software Engineer

For two decades the bottleneck in software engineering was writing code. We optimized IDEs, autocompletion, refactoring tools, and frameworks to make typing cheaper. We won. Now the bottleneck moved one step downstream: writing is cheap, and reading is expensive. The PR-bot can spin up ten implementation attempts in parallel and open ten pull requests against your repo before you finish your morning coffee. Your reviewers cannot.

The rate limiter for AI-assisted software delivery is no longer the model's tokens per second. It is the number of human eyes you can put on a diff per day. And when those eyes get overwhelmed, you do not get a graceful degradation — you get rubber stamps. Code merges with LGTM 🚀 on top of code that nobody actually read. A senior engineer approves an AI-written patch that another AI tool already reviewed, and three weeks later a data-inconsistency bug eats forty hours of someone's life. Surface correctness is not systemic correctness, and a green pipeline is not understanding.

The PR Description Your Coding Agent Cannot Write

· 10 min read
Tian Pan
Software Engineer

Your coding agent finished the task. The diff is small, the tests are green, the lint is clean, and the PR body says, in its entirety, "Fixes the bug in module X." A reviewer six time zones away opens the page, reads the diff in isolation, sees nothing wrong with it, and approves a technically correct change that solves the wrong problem. The change ships. Two days later a customer asks why the workaround they had been relying on stopped working, and you discover that the bug your agent fixed was not the bug the ticket was about.

The code was fine. The reviewer was conscientious. The agent did exactly what it was asked. The artifact between them — the pull request — was empty of everything that would have caught the mistake.

The Redaction Layer Your Agent Cannot Reason Through

· 9 min read
Tian Pan
Software Engineer

A privacy review approves your redaction layer. Names, emails, account numbers, phone numbers — all scrubbed before the prompt reaches the model. Your single-turn classifier still hits 94% accuracy. Six weeks later your multi-step agent starts giving confidently wrong answers to questions like "is the email Sarah used to log in the same as the one on her billing record?" and nobody can reproduce it in dev.

The redaction layer did exactly what infosec asked it to do. It also quietly destroyed the property your agent's reasoning depended on: that two mentions of the same entity in different turns refer to the same thing. The agent isn't hallucinating. It's reading a transcript where Sarah has become three different people and the "same" email address has become two distinct placeholders.

The Typo Your Agent Learned to Honor

· 10 min read
Tian Pan
Software Engineer

An insurance carrier fine-tuned a support model on a year of chat transcripts. Within a week of launch, a compliance reviewer flagged something odd: the bot kept writing "deductable" instead of "deductible." Not occasionally — consistently, in roughly the same one-in-eight messages where the word appeared. The model had not invented the misspelling. It had inherited it. A handful of tier-1 reps had been typing it that way for two years, and the corpus reflected what they typed, not what the dictionary said.

This is the unsettling thing about supervised fine-tuning on operational data: the model is not learning your domain. It is learning your corpus. Those two things overlap, but they are not the same, and the gap is where every preventable behavioral defect lives. Frequency in your training data is not a signal of correctness. It is a signal of what your team happened to do enough times for the model to mimic it.

The misspelling is the easy case to spot. The hard cases are the ones nobody bothered to write down as rules, because everyone assumed the model would learn the "professional" version of the work rather than the actual work as performed.

The Verifier Loop That Couldn't Converge

· 11 min read
Tian Pan
Software Engineer

The most expensive bug in an agent system is the one with no error message. Worker proposes a draft. Verifier rejects it with a paragraph of feedback. Worker revises. Verifier rejects again. The loop keeps spinning, the trace keeps growing, the bill keeps climbing, and from the outside the system looks like it is working — diligently, in fact, because both models are doing their assigned job. What nobody priced in is that the verifier's acceptance criteria are not fixed across calls. The target the worker is chasing is moving, and the loop has no convergence guarantee.

You shipped "iterate until satisfied," and you shipped a search through a space whose extrema may not exist.

The Token Budget That Ran Out Mid-Conversation: Why Free-Tier Users Think Your Model Got Dumber

· 12 min read
Tian Pan
Software Engineer

A product manager I know spent two weeks triaging a churn spike on her company's AI writing assistant. Free-tier session length had collapsed by 30%, the support inbox filled up with variations of "your model used to be smart, now it's lazy," and the team's first instinct was to blame a model upgrade that had shipped the same week. The model had not changed. What had changed was that finance had quietly tightened the per-user token budget mid-quarter, and the app had been silently truncating system prompts, dropping tool calls, and shortening responses for any user who crossed the new threshold. From the user's seat, the AI had degraded. From the dashboard, nothing was wrong. Both were true, and that is the failure mode.

This pattern is everywhere now. ChatGPT's free tier drops to a smaller model when the limit is hit, with no in-product label other than "responses may be shorter for a while." Anthropic's free tier behaves similarly. Build a feature on top of either, layer on your own per-user budget for cost control, and you have stacked two invisible cliffs in series — the platform's and yours — and the user, who only sees one chat box, has no way to tell which one they just walked off.

The Deflection Metric That Lied: When AI Support Success Hides User Churn

· 10 min read
Tian Pan
Software Engineer

A support leader I spoke with last quarter was glowing about a 78% deflection rate from the new AI agent. Tickets routed to humans had collapsed; cost per contact looked beautiful; the dashboard sparkled green for three straight months. Then revenue ops ran a cohort analysis. The customers who had hit the bot at least once during a billing question were churning at 1.7x the rate of customers who had not. The deflection metric had not measured help. It had measured silence — and silence turned out to be the sound of paying users walking out the door.

This is the failure mode that the industry is now naming aloud. Deflection counts conversations where the customer did not reach a human. It does not distinguish "I got my answer" from "I gave up." Treat those as the same number and you will optimize for the second one, because making the bot harder to escape is much easier than making it actually resolve issues. Klarna learned this publicly in 2026 when it began rehiring customer service staff a year after announcing AI had replaced roughly 700 agents; repeat contacts had jumped about 25%, and the savings line that justified the layoffs evaporated against the cost of re-handling everything the bot mishandled the first time.

The Judge That Agreed With Itself Across A and B

· 10 min read
Tian Pan
Software Engineer

You run an A/B test on two prompt variants. Your LLM judge — same vendor as your candidate model, because it was the easy default — consistently prefers variant A by a margin large enough to call statistically meaningful. You ship A. A week later your satisfaction metric is down, your refund queue is up, and nobody can explain it. Somebody finally re-runs the eval with a judge from a different model family. The preference flips.

The judge was not measuring quality. The judge was measuring how much the candidate sounded like the judge.

Your AI Disclosure Disappeared by Turn Three and Nobody Noticed Until the Regulator Did

· 11 min read
Tian Pan
Software Engineer

Your legal team spent four meetings negotiating the exact disclosure sentence. Engineering put it at the top of the system prompt. QA confirmed it appears in turn one of every session. Three months later a regulator forwards a transcript: turn fourteen of a complaint-handling conversation, an hour of substantive guidance about a refund dispute, and nowhere in those fourteen turns does the user see the words "I am an AI." The disclosure your single-turn compliance review approved is structurally incapable of surviving the conversations that need it.

This is disclosure decay, and it is the multi-turn agentic failure mode that the wave of 2025–2026 chatbot regulation was not designed to catch and your QA process is not configured to test for. The EU AI Act's Article 50 obligations become enforceable on August 2, 2026, with fines up to €35 million or 7% of global turnover. California's SB 243 took effect January 1, 2026, with a private right of action that lets consumers sue directly for at least $1,000 per violation. Washington requires recurring disclosures, with hourly cadences for minors. None of these regimes were written assuming the disclosure would silently drop out of a session after the third tool call — but that is what your runtime is doing right now, on every long-running conversation, in production.

Inference Billing as a P&L Line Item Nobody Owns

· 9 min read
Tian Pan
Software Engineer

Somewhere in your company, four people each believe a fifth person owns the inference bill. Engineering treats it as a cloud line item. The AI team treats it as the price of building. Finance treats it as a variable margin input that someone in engineering must already be managing. Product treats it as overhead that engineering absorbs. The bill keeps growing, and the only thing everyone agrees on is that it isn't theirs.

This is not a budgeting problem. It is an ownership vacuum, and it surfaces the first time the line item gets large enough for a CFO to ask about it on a board call. By then, the answers people improvise — "we'll optimize," "we'll cache more," "we'll switch models" — describe interventions without naming an owner. The conversation that should have happened a year earlier was not about how to lower the bill. It was about whose P&L the bill belonged to in the first place.

The shift is structural. Inference moved from 15% of enterprise AI spend in 2024 to roughly 85% in 2026, and the average enterprise AI budget grew from $1.2M to around $7M over the same window. A line item that was once rounding error is now the kind of number a board notices, and the org chart written before that shift has no row for it.

Multimodal Traces: When Modalities Must Share an ID

· 11 min read
Tian Pan
Software Engineer

A user called your support agent. They talked, the agent listened, the user uploaded a screenshot of the error mid-call, the agent reasoned over the image and the transcript, and the conversation ended with a follow-up email summarizing the fix. Three days later the user files a complaint: the fix did not work, and the email never arrived. You open your observability stack and you find three separate traces in three separate UIs. The voice pipeline shows you an ASR trace. The vision pipeline shows a span over the image upload. The LLM call shows a chat trace with a token count and a tool call. Nothing in any of these dashboards tells you they were the same conversation.

This is the postmortem nobody wants to write. Not because the data is missing — every individual modality logged what it was supposed to — but because the join across modalities was never built. Each pipeline grew its own tracing convention from whatever its model vendor shipped by default, and the conversational turn that bound them together exists only in the head of the engineer who designed the agent.

The Agent That Retried Its Way Past Your Rate Limit

· 10 min read
Tian Pan
Software Engineer

Your gateway enforces a clean 100 requests per second per tenant. The dashboard shows every tenant comfortably under that ceiling. The bill from your model provider says you blew through the spend cap anyway. Nobody on the rollout call has a clean story for why.

The answer is that the rate limiter and the bill are measuring different things. The limiter sees one "user request" when a customer clicks a button. The provider sees a planner call, three tool-result reflections, a format-correction retry triggered by a stricter JSON schema, and a final synthesis — each with its own internal retry budget that fires when a transient 429 or 500 comes back. A single click can fan out into thirty model calls. The limiter counts one. The bucket leaks at thirty times the rate it was sized for.

Rate-limiting an agentic system at the HTTP boundary is enforcing speed limits at the highway entrance while the cars inside multiply. Until the limiter understands the loop, the loop will route around it.