<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://tianpan.co/blog</id>
    <title>TianPan.co</title>
    <updated>2026-06-02T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://tianpan.co/blog"/>
    <subtitle>Actionable essays, playbooks, and investor-grade memos on product, engineering leadership, and SaaS—so you ship faster and decide with conviction.</subtitle>
    <icon>https://tianpan.co/favicon.ico</icon>
    <rights>All rights reserved 2026, Tian Pan</rights>
    <entry>
        <title type="html"><![CDATA[Retrieval Pipeline Residency: The Embedding That Crossed the Border Your LLM Call Didn't]]></title>
        <id>https://tianpan.co/blog/2026-06-02-retrieval-pipeline-residency-the-embedding-that-crossed-the-border-your-llm-call-didnt</id>
        <link href="https://tianpan.co/blog/2026-06-02-retrieval-pipeline-residency-the-embedding-that-crossed-the-border-your-llm-call-didnt"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Your inference endpoint is pinned to Frankfurt. Your embedding API, vector control plane, rerank service, prompt cache, and trace store are not. A walkthrough of the six residency surfaces in a RAG request and the org gap where each one quietly crosses the border.]]></summary>
        <content type="html"><![CDATA[<p>The team that ships "AI for EU customers" usually ships exactly one residency control: an inference endpoint pinned to an EU region. The procurement team gets a DPA, the architecture diagram gets a green checkmark next to "model hosted in Frankfurt," and the launch proceeds. What the diagram doesn't show is that the customer's verbatim query gets vectorized by a US-hosted embedding API on its way to the model, that the vector store the query is matched against has its operational plane in <code>us-east-1</code>, that the rerank model is a third-party SaaS deployed wherever the vendor chose, that the prompt cache is keyed regionally on hits and globally on misses, and that the trace store logging the retrieved chunks has a 30-day retention bucket that replicates cross-region for redundancy.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=Retrieval%20Pipeline%20Residency%3A%20The%20Embedding%20That%20Crossed%20the%20Border%20Your%20LLM%20Call%20Didn%27t" alt="" class="img_ev3q"></p>
<p>The inference layer respects residency. The retrieval pipeline doesn't even know it's a participant.</p>
<p>This is the gap where most "GDPR-compliant" RAG deployments fail an audit the team didn't realize was coming. The fix isn't another control on the model call — it's recognizing that data residency is a property of every component the customer's bytes touch, and that the team owning "the LLM" owns at most one of the six surfaces involved.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-pipeline-has-six-residency-surfaces-not-one">The pipeline has six residency surfaces, not one<a href="https://tianpan.co/blog/2026-06-02-retrieval-pipeline-residency-the-embedding-that-crossed-the-border-your-llm-call-didnt#the-pipeline-has-six-residency-surfaces-not-one" class="hash-link" aria-label="Direct link to The pipeline has six residency surfaces, not one" title="Direct link to The pipeline has six residency surfaces, not one" translate="no">​</a></h2>
<p>When a user submits a query to a region-pinned RAG system, here is the actual list of network calls — each of which is a potential cross-border transfer:</p>
<ol>
<li class=""><strong>Query embedding.</strong> The user's verbatim text is sent to an embedding model, typically a third-party API (OpenAI, Cohere, Voyage). The request payload contains the customer's data in its original form.</li>
<li class=""><strong>Vector lookup.</strong> The embedding is queried against a vector database (Pinecone, Weaviate, Qdrant, pgvector). The data plane and the operational/control plane often live in different regions.</li>
<li class=""><strong>Keyword or hybrid search.</strong> A BM25 or full-text index runs in parallel with the vector search. This index has its own deployment region and its own backup policy.</li>
<li class=""><strong>Rerank.</strong> A cross-encoder reranks the top-k candidates against the query. Most teams use a managed API (Cohere Rerank, Voyage Rerank, Jina). The query and the candidate passages both transit to whichever region the rerank vendor deployed in.</li>
<li class=""><strong>LLM inference.</strong> The pinned call — the one residency was designed around.</li>
<li class=""><strong>Trace + observability.</strong> Every retrieved chunk, every prompt, every output gets logged to a trace store (LangSmith, Phoenix, Datadog LLM Observability, Helicone). The trace store has its own region, its own retention class, and often its own replication policy.</li>
</ol>
<p>A typical generative AI application now spans a model API, a vector store, an observability backend, an evaluation harness, a prompt-cache layer, and a feedback loop into fine-tuning or preference data. Every one of those surfaces is a potential cross-border transfer. Arguments that focus only on "where the model runs" miss five out of six.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="three-failure-modes-that-survive-every-we-hosted-it-in-frankfurt-review">Three failure modes that survive every "we hosted it in Frankfurt" review<a href="https://tianpan.co/blog/2026-06-02-retrieval-pipeline-residency-the-embedding-that-crossed-the-border-your-llm-call-didnt#three-failure-modes-that-survive-every-we-hosted-it-in-frankfurt-review" class="hash-link" aria-label="Direct link to Three failure modes that survive every &quot;we hosted it in Frankfurt&quot; review" title="Direct link to Three failure modes that survive every &quot;we hosted it in Frankfurt&quot; review" translate="no">​</a></h2>
<p><strong>The embedding API that wasn't on the diagram.</strong> The team picked an embedding model from a SaaS vendor because it benchmarked well on their corpus. The vendor's API has a single global endpoint. Every query the EU user types is sent verbatim — name, email, free-text complaint, whatever they typed — to a US data center for vectorization. The vectors come back, the vector lookup happens in-region, the LLM call happens in-region, and the team genuinely believes the system is residency-compliant because the bytes "didn't leave the region in the model call." The bytes left the region one hop earlier and nobody drew that arrow.</p>
<p><strong>The control plane that lived where the data plane didn't.</strong> Managed vector databases now uniformly offer regional data planes. Their control planes — the dashboards, the index management APIs, the operational telemetry, the per-query latency logs — frequently do not. A regulator who knows what they're looking for asks whether the query text appears in any system outside the region. The answer is almost always yes: the query latency log fires through a global observability backend with the prompt or chunk content attached as a tag for debugging convenience.</p>
<p><strong>The prompt cache that was regional on the hit path and global on the miss.</strong> Cache lookups are keyed on a hash of the prompt. The hash lookup is regional. On a cache miss, the system falls through to a "shared" path that may live elsewhere. Stanford researchers detected global cache sharing across users in seven API providers in 2025 — meaning the cache wasn't even isolated by tenant, let alone by region. If your latency-sensitive frontend depends on cache hits, the cache's residency posture is part of your residency posture, and the provider's defaults are not necessarily what you assumed.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-org-chart-is-where-the-gap-actually-lives">The org chart is where the gap actually lives<a href="https://tianpan.co/blog/2026-06-02-retrieval-pipeline-residency-the-embedding-that-crossed-the-border-your-llm-call-didnt#the-org-chart-is-where-the-gap-actually-lives" class="hash-link" aria-label="Direct link to The org chart is where the gap actually lives" title="Direct link to The org chart is where the gap actually lives" translate="no">​</a></h2>
<p>The technical gap maps onto an organizational gap, and the org chart is usually the easier one to debug first. In most enterprises shipping RAG:</p>
<ul>
<li class="">The <strong>AI team</strong> owns the model endpoint and the prompt template.</li>
<li class="">The <strong>platform team</strong> owns the vector database deployment.</li>
<li class="">The <strong>search/retrieval team</strong> owns the rerank service and the keyword index.</li>
<li class="">The <strong>observability team</strong> owns the trace store and the dashboards.</li>
<li class="">The <strong>security/compliance team</strong> owns the DPA and the residency contract with the customer.</li>
</ul>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="data-residency" term="data-residency"/>
        <category label="rag" term="rag"/>
        <category label="compliance" term="compliance"/>
        <category label="gdpr" term="gdpr"/>
        <category label="retrieval" term="retrieval"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The 40-Point Gap Between Your Interviewers When the Candidate Says 'I'd Just Prompt It']]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-40-point-gap-between-your-interviewers-on-id-just-prompt-it</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-40-point-gap-between-your-interviewers-on-id-just-prompt-it"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A forty-point disagreement on the same candidate is not a candidate problem — it's a rubric problem. How to calibrate an AI-engineer hiring loop your own team cannot yet agree on.]]></summary>
        <content type="html"><![CDATA[<p>The candidate hit the wall on the system-design question, paused for two seconds, and said: "I'd just prompt it." Your most senior interviewer wrote <em>strong hire — this is exactly how good engineers work in 2026</em>. Your second-most-senior interviewer wrote <em>no hire — handing the problem to a chatbot is not engineering</em>. Same five words. Same forty-minute window. A forty-point gap on the same scorecard.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%2040-Point%20Gap%20Between%20Your%20Interviewers%20When%20the%20Candidate%20Says%20%27I%27d%20Just%20Prompt%20It%27" alt="" class="img_ev3q"></p>
<p>The candidate didn't fail your loop. Your loop failed to have an opinion. And the worst part of the debrief is not the disagreement — it's the way each interviewer is so confident their read is the correct one that the meeting devolves into a referendum on AI itself rather than on whether this human can ship.</p>
<!-- -->
<p>This isn't a candidate-quality problem. It's a rubric-integrity problem dressed up as one, and the longer it goes uncalibrated the more your hiring bar becomes a function of which interviewers were on the panel that week instead of what the role actually requires.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-rubric-you-copied-from-2022-is-grading-a-job-that-no-longer-exists">The rubric you copied from 2022 is grading a job that no longer exists<a href="https://tianpan.co/blog/2026-06-02-the-40-point-gap-between-your-interviewers-on-id-just-prompt-it#the-rubric-you-copied-from-2022-is-grading-a-job-that-no-longer-exists" class="hash-link" aria-label="Direct link to The rubric you copied from 2022 is grading a job that no longer exists" title="Direct link to The rubric you copied from 2022 is grading a job that no longer exists" translate="no">​</a></h2>
<p>The interview loop you're running was almost certainly built around a definition of <em>engineering competence</em> that predates the working pattern your team actually uses every day. The loop checks whether the candidate can implement a small algorithm from scratch, reason about time complexity, and walk through a system-design problem on a whiteboard. None of those checks are wrong — they're just no longer load-bearing for what an AI engineer at your company will spend Tuesday morning doing.</p>
<p>The Tuesday-morning reality is closer to: read a 4,000-line module that nobody on the team wrote, decide what the LLM-generated first draft of the change got wrong, push back on the parts that are subtly broken, accept the parts that are subtly correct, and own the result whether the model wrote it or not. The 2022 rubric grades the "implement from scratch" muscle that the 2026 job rarely uses, and gives no points for the "review and edit AI output" muscle that the 2026 job uses constantly.</p>
<p>So when a candidate skips to "I'd just prompt it" without showing the underlying reasoning, your loop has no agreed-upon way to disambiguate two very different signals: the senior who is correctly identifying that this is a solved problem the model handles well, and the junior who is hiding the fact that they cannot reason about the problem at all. Both produce the same five-word answer. Only one of them is the candidate you want to hire.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-disagreement-is-the-data--not-the-noise">The disagreement is the data — not the noise<a href="https://tianpan.co/blog/2026-06-02-the-40-point-gap-between-your-interviewers-on-id-just-prompt-it#the-disagreement-is-the-data--not-the-noise" class="hash-link" aria-label="Direct link to The disagreement is the data — not the noise" title="Direct link to The disagreement is the data — not the noise" translate="no">​</a></h2>
<p>The standard reaction to a forty-point spread in a debrief is to argue harder, average the scores, or defer to the most senior voice. All three are wrong responses. The spread is the most valuable artifact your loop produced this week, and treating it as a vote-counting problem rather than a signal-extraction problem is how hiring bars quietly drift for years before anyone notices.</p>
<p>Inter-rater reliability is the boring statistical name for the thing that's broken. When structured-interview research reports inter-rater reliability climbing from around 0.37 to around 0.67 after calibration work, what it's really saying is: before calibration, your interviewers were agreeing only marginally above chance, and after calibration they were agreeing well enough that the panel's decision means something. The forty-point gap on "I'd just prompt it" is the unmistakable shape of an IRR below 0.4.</p>
<p>The fix isn't more interviewers. It isn't a more detailed rubric. It isn't a longer debrief. It's a calibration session where the panel sits down with the same recorded candidate answer and surfaces the <em>reasons</em> their scores diverged. Not "I felt strong hire" versus "I felt no hire" — but "I gave strong hire because pushing back on AI output requires the same judgment as writing it from scratch, and the candidate showed that judgment in their follow-up" versus "I gave no hire because the candidate didn't explain <em>what</em> they'd prompt or <em>how</em> they'd verify it, and I can't tell whether they have the judgment from this answer alone." Those are two different rubric items. Right now they're collapsed into one score.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-youre-actually-hiring-for-in-2026">What you're actually hiring for in 2026<a href="https://tianpan.co/blog/2026-06-02-the-40-point-gap-between-your-interviewers-on-id-just-prompt-it#what-youre-actually-hiring-for-in-2026" class="hash-link" aria-label="Direct link to What you're actually hiring for in 2026" title="Direct link to What you're actually hiring for in 2026" translate="no">​</a></h2>
<p>The "AI engineer" job title is doing too much work. The same posting is being used to recruit at least three different roles that share a stack but require materially different skills, and your interview loop is likely scoring all three against the same rubric.</p>
<p>The first role is the <em>product-AI engineer</em> who composes existing models into a working product surface. They spend their time on retrieval design, prompt iteration, eval construction, latency budgets, and the unglamorous integration work of making the model behave in a specific business context. They need taste, system thinking, and the discipline to write evals before they write features.</p>
<p>The second role is the <em>AI-platform engineer</em> who builds the inference, training, or RAG infrastructure that other engineers consume. They need depth in distributed systems, observability, and the unsexy plumbing of running GPU workloads reliably. The "I'd just prompt it" answer is a red flag for this role almost regardless of who said it, because their job is to build the layer that makes prompting work at all.</p>
<p>The third role is the <em>ML/research engineer</em> who is closer to the model itself — fine-tuning, evaluation methodology, or original training work. They still need to know the math, can usually implement a transformer block from scratch, and would treat "I'd just prompt it" as an admission that the candidate has no opinion about model behavior.</p>
<p>If your panel is interviewing all three roles against one rubric, "I'd just prompt it" is <em>strong hire</em> for role one, <em>neutral</em> for role two, and <em>no hire</em> for role three. The forty-point spread isn't disagreement about the candidate. It's three different interviewers scoring against the rubric they wish they were using.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-calibration-protocol-that-actually-surfaces-the-disagreement-before-the-offer">A calibration protocol that actually surfaces the disagreement before the offer<a href="https://tianpan.co/blog/2026-06-02-the-40-point-gap-between-your-interviewers-on-id-just-prompt-it#a-calibration-protocol-that-actually-surfaces-the-disagreement-before-the-offer" class="hash-link" aria-label="Direct link to A calibration protocol that actually surfaces the disagreement before the offer" title="Direct link to A calibration protocol that actually surfaces the disagreement before the offer" translate="no">​</a></h2>
<p>The pattern that works is uncomfortable because it requires the panel to admit they don't already agree. Most interview loops skip calibration because the senior people on it have been doing this for years and assume their judgment is the standard. The 0.37 IRR number says otherwise.</p>
<p>A workable protocol has four pieces. First, before the loop runs, the panel agrees on the <em>job-shaped rubric</em> — which of the three roles above this requisition is hiring for, and which competencies are load-bearing versus nice-to-have. The output is one page, not a deck, and the disagreements that surface during this conversation are exactly the disagreements the debrief was previously surfacing too late.</p>
<p>Second, the panel reviews two or three recorded or transcribed candidate sessions together and scores them independently before discussing. The point is not to agree — it's to discover where the disagreements are. A pattern of disagreement that recurs across multiple sessions (one interviewer always rates the "I'd just prompt it" answer higher than another) is a rubric ambiguity, not a personality clash.</p>
<p>Third, the rubric grows <em>anchor examples</em> for the ambiguous competencies. "Demonstrates judgment about AI output" is too abstract to score consistently. "Identifies a specific failure mode in the model's first draft and proposes a verification step before merging" is concrete enough that two interviewers will land on similar scores. The anchors are written from the disagreements surfaced in step two.</p>
<p>Fourth, the panel runs a periodic re-calibration on the candidates they've already hired and the ones they've already rejected. The question is not "did we get this right" — it's "would this scorecard land on the same decision today, and if not, has the rubric drifted or has the bar drifted?" When the senior interviewer who pushed for the "strong hire on I'd just prompt it" candidate is now frustrated with that hire's inability to reason about edge cases, the calibration session is where that signal becomes actionable rather than personal.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-leadership-move-when-your-own-team-cant-agree">The leadership move when your own team can't agree<a href="https://tianpan.co/blog/2026-06-02-the-40-point-gap-between-your-interviewers-on-id-just-prompt-it#the-leadership-move-when-your-own-team-cant-agree" class="hash-link" aria-label="Direct link to The leadership move when your own team can't agree" title="Direct link to The leadership move when your own team can't agree" translate="no">​</a></h2>
<p>The hardest part of fixing this is not the protocol. It's admitting, in front of the engineers you respect, that the function whose hiring rubric you cannot defend is the function you are responsible for staffing. The instinct is to keep running the loop and hope the disagreements average out across enough panels. They don't. They drift. The bar your loop is actually enforcing is the union of whichever interviewers happened to be available that week, and the function you build over twelve months is shaped by that union rather than by any explicit decision.</p>
<p>The leadership move is to pause the loop long enough to run the calibration session, even though the requisition is open and the recruiter is impatient and the engineering manager wants headcount before the next OKR cycle. A panel that cannot agree on what "I'd just prompt it" means is a panel that will keep making hiring decisions on which neither the strong-hire interviewer nor the no-hire interviewer can later defend. The cost of staffing a function whose hiring rubric your own team cannot agree on is not a one-time miss — it's a year of mis-leveled offers, a year of debriefs that turn into AI-philosophy debates, and a year of post-hoc rationalization about whether the people you brought in are doing the job you needed.</p>
<p>The candidate's five-word answer isn't the ambiguity. Your rubric is. Fix that before you run the next loop, and the debriefs start being about candidates again.</p>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="hiring" term="hiring"/>
        <category label="ai-engineering" term="ai-engineering"/>
        <category label="interview-loop" term="interview-loop"/>
        <category label="calibration" term="calibration"/>
        <category label="engineering-leadership" term="engineering-leadership"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The A/B Test Powered by Token Counts Instead of Outcomes]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-a-b-test-powered-by-token-counts-instead-of-outcomes</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-a-b-test-powered-by-token-counts-instead-of-outcomes"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[When the experiment platform makes token counts easy and user outcomes hard, prompt A/B tests ship local maxima the team cannot distinguish from regressions.]]></summary>
        <content type="html"><![CDATA[<p>A team I worked with shipped a prompt change that reduced output tokens by 22%. The experiment dashboard lit up green — variance was tight, the p-value was clean, and the cost savings extrapolated to six figures a year. Two weeks later, a product analyst poking at conversion funnels flagged that the downstream task completion rate had dropped 11% in the same window. The shorter outputs were leaving out a clarifying step that users had been quietly relying on to know what to click next.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20A%2FB%20Test%20Powered%20by%20Token%20Counts%20Instead%20of%20Outcomes" alt="" class="img_ev3q"></p>
<p>The experiment platform had not lied. It had reported the exact metric the team configured as primary, and that metric had moved in the right direction. The problem was that the metric measured something the team did not actually care about. Tokens were cheap to count, the experiment infra had a turnkey integration for them, and outcomes were hard to instrument — so the team picked what the platform made easy. The result was a clean win on the dashboard and a regression in the product.</p>
<p>This pattern shows up across AI-powered features the same way that vanity-metric A/B testing showed up across the rest of the web a decade earlier. The mechanism is identical: when the easy-to-measure proxy and the hard-to-measure outcome are not perfectly correlated, the experimentation infrastructure will optimize for whichever metric it makes easy to ship on. The team that did not force itself to measure the outcome is shipping local maxima it cannot distinguish from regressions.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-proxy-substitution-trap">The Proxy Substitution Trap<a href="https://tianpan.co/blog/2026-06-02-the-a-b-test-powered-by-token-counts-instead-of-outcomes#the-proxy-substitution-trap" class="hash-link" aria-label="Direct link to The Proxy Substitution Trap" title="Direct link to The Proxy Substitution Trap" translate="no">​</a></h2>
<p>The substitution happens almost without anyone noticing. The team starts with a question phrased at the outcome layer — "does this prompt change make the assistant more useful?" — and ends with an experiment configured at the proxy layer — "does this prompt change reduce average output tokens?" The translation looks reasonable on the way down because tokens correlate with cost, cost correlates with margin, and margin correlates with "useful" in a sufficiently abstract sense. But each step throws away signal, and by the time the dashboard is built, the question being answered bears only a passing resemblance to the question that was asked.</p>
<p>Goodhart's Law describes the same dynamic in the abstract: when a measure becomes a target, it ceases to be a good measure. The literature on it gets dense, but the engineering version is simple. The moment your experiment promotion criteria depend on a proxy, your team will start producing changes that move the proxy without moving the underlying thing the proxy was supposed to track. This is not malice. It is what optimization does. Reinforcement learning research has documented this so thoroughly that the field has its own name for it — reward hacking — and the same effect operates in human teams running prompt experiments.</p>
<p>The reason it is hard to resist is that proxy metrics are genuinely useful as secondary signals. Token counts tell you something real about cost. Latency tells you something real about user experience. The mistake is not measuring them; it is letting them carry the weight of a primary metric they were never designed to bear. A team that measures token reduction as a guardrail to make sure cost does not blow up is doing it right. A team that ships changes because token reduction moved is doing it wrong, and the dashboard cannot tell the difference.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-llm-features-are-especially-vulnerable">Why LLM Features Are Especially Vulnerable<a href="https://tianpan.co/blog/2026-06-02-the-a-b-test-powered-by-token-counts-instead-of-outcomes#why-llm-features-are-especially-vulnerable" class="hash-link" aria-label="Direct link to Why LLM Features Are Especially Vulnerable" title="Direct link to Why LLM Features Are Especially Vulnerable" translate="no">​</a></h2>
<p>Three properties of LLM-powered features make this failure mode worse than in classical product experimentation.</p>
<p>The first is that the output is high-dimensional. A traditional A/B test measures whether a button got clicked, and the click is the outcome. An LLM A/B test measures whether a response was generated, and the response is a thousand tokens with internal structure that maps to user behavior in ways the experiment framework does not see. The natural primary metric — was this response good — is not directly observable. So teams reach for whatever is observable, and what is observable is the response's surface properties: token count, latency, refusal rate, format compliance. None of these is the outcome.</p>
<p>The second is that cost pressure is constant and visible. Token spend is a line item the CFO asks about. Task completion rate is a metric the product team has to build infrastructure to measure. The asymmetry of organizational attention means that token-related metrics get instrumented first, dashboards get built first, OKRs get tied to them first, and by the time anyone asks "but did users get what they wanted," the experiment culture has already calcified around what was easy.</p>
<p>The third is that the path from output quality to user outcome is mediated by a longer chain than in classical features. A shorter output might cause a user to ask a follow-up question instead of completing the task, which raises the conversation length, which raises the per-session cost, which cancels the savings from the original prompt change — but none of that shows up in a single-turn A/B test scoped to first-response token count. The cost win is real at the level the test measured, and the cost loss is real at the level the user experiences, and the experiment platform reports only the first one.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-forced-taxonomy-of-metrics">A Forced Taxonomy of Metrics<a href="https://tianpan.co/blog/2026-06-02-the-a-b-test-powered-by-token-counts-instead-of-outcomes#a-forced-taxonomy-of-metrics" class="hash-link" aria-label="Direct link to A Forced Taxonomy of Metrics" title="Direct link to A Forced Taxonomy of Metrics" translate="no">​</a></h2>
<p>The first concrete fix is to make the proxy/outcome distinction structural rather than cultural. Cultural rules — "remember to also look at the user impact" — do not survive contact with quarterly reviews. Structural rules — "the experiment platform will not let you mark this experiment shippable without a classified primary metric" — do.</p>
<p>The taxonomy that has worked well is three-tier:</p>
<ul>
<li class=""><strong>Outcome metrics</strong>: measurements of whether the user got what they were trying to get. Task completion rate, accepted code edit rate, ticket resolved without escalation, search result clicked through to a satisfied dwell time. These are expensive to instrument and slow to measure, and they are the only metrics on which an experiment can be declared a win.</li>
<li class=""><strong>Proxy metrics</strong>: cheap, fast, technical measurements that correlate with outcomes but are not outcomes. Token counts, latency, response length, format compliance, refusal rate. These are useful for debugging and for understanding the mechanism of a win, but a movement here alone is not sufficient evidence to ship.</li>
<li class=""><strong>Guardrail metrics</strong>: measurements of harm that the change must not cause. Cost per resolved task, p95 latency, hallucination rate on a held-out eval, user-reported negative feedback. These have looser detection thresholds because they exist to catch large regressions, not to drive ship decisions.</li>
</ul>
<p>The taxonomy only does work if the experiment platform enforces it — if the "promote" button is greyed out when only proxy metrics have moved, and only the user-outcome layer can light it up green. Teams that classify metrics in a slide deck and then fail to enforce the classification at the tool layer end up with the same pre-taxonomy problem in different clothes. The point of the taxonomy is to remove the option of accidentally shipping on a proxy, not to remind people that they could.</p>
<p>This also forces a useful conversation about which metrics are outcomes for a given feature. A code completion product's outcome is the accepted-and-retained edit, not the edit suggestion shown. A search product's outcome is the satisfied query, not the result click. A support agent's outcome is the resolved ticket without escalation, not the response that closes the conversation. Different products have different outcomes, and the taxonomy work is mostly the work of naming them precisely.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-slow-cohort-that-anchors-the-decision">The Slow Cohort That Anchors the Decision<a href="https://tianpan.co/blog/2026-06-02-the-a-b-test-powered-by-token-counts-instead-of-outcomes#the-slow-cohort-that-anchors-the-decision" class="hash-link" aria-label="Direct link to The Slow Cohort That Anchors the Decision" title="Direct link to The Slow Cohort That Anchors the Decision" translate="no">​</a></h2>
<p>A/B tests on AI features tend to run on short windows because the AI feature itself iterates fast. Two weeks is a generous experiment duration in a team shipping prompt changes weekly. The problem is that the outcome metrics you actually care about often have longer feedback loops than the experiment supports.</p>
<p>A user who got a slightly worse response on Tuesday might not have noticed it; they might have asked a follow-up that the dashboard counted as a separate session; they might have churned three weeks later for reasons the experiment will never connect back to that prompt. The two-week window is too short to detect the cohort effect, and the AB platform is set up to declare a verdict at the end of the window regardless.</p>
<p>The pattern that closes this gap is a longitudinal outcome cohort that runs alongside the A/B but on a slower clock. The A/B determines what gets tentatively shipped; the cohort determines what stays shipped. Concretely:</p>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="ai-engineering" term="ai-engineering"/>
        <category label="experimentation" term="experimentation"/>
        <category label="llm-ops" term="llm-ops"/>
        <category label="metrics" term="metrics"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-agent-budget-that-approved-cost-per-call-and-never-measured-cost-per-resolved-task</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-agent-budget-that-approved-cost-per-call-and-never-measured-cost-per-resolved-task"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[An agent that drives cost-per-call down 25% while cost-per-resolved-task drifts up 40% is the most common unit-economics failure in agentic deployments. Here is why the vendor SKU is not the unit of work, and how to put the right metric on the wall.]]></summary>
        <content type="html"><![CDATA[<p>A quarter into the rollout, the AI team reported a 25% reduction in average cost-per-API-call. The support team reported that average handle time on AI-routed tickets had drifted from four turns to seven. Both numbers were correct. Both teams were measuring the system they had been told to optimize. The finance team, sitting between them, could not reconcile the dashboards because neither one was denominated in the thing the customer was actually paying for: a resolved ticket. The cost-per-call had gone down. The cost-per-resolved-task had gone up 40%. Nobody owned that number, so nobody was watching it move.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Agent%20Budget%20That%20Approved%20Cost-Per-Call%20and%20Never%20Measured%20Cost-Per-Resolved-Task" alt="" class="img_ev3q"></p>
<p>This is the most common unit-economics failure I see in agentic deployments, and it is not a measurement bug. It is a definitional one. The vendor's pricing page exposes cost-per-call because that is the unit they bill. The spreadsheet line item inherits that unit because it fits in a cell. The engineering team optimizes against the unit they were given. By the time the gap between API economics and business economics becomes visible, it has been compounding for a quarter, and the agent has been quietly trained on the wrong loss function the entire time.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-vendor-sku-is-not-the-unit-of-work">The vendor SKU is not the unit of work<a href="https://tianpan.co/blog/2026-06-02-the-agent-budget-that-approved-cost-per-call-and-never-measured-cost-per-resolved-task#the-vendor-sku-is-not-the-unit-of-work" class="hash-link" aria-label="Direct link to The vendor SKU is not the unit of work" title="Direct link to The vendor SKU is not the unit of work" translate="no">​</a></h2>
<p>The unit of work the customer pays for is rarely the unit the model provider bills for. A customer pays for a resolved ticket, an accepted suggestion, a completed booking, a generated brief that ships without rewrite. The provider bills for tokens, or seats, or model calls. These are not the same unit, and the conversion ratio between them is the thing that determines whether the agent has positive unit economics.</p>
<p>The naive calculation that gets put into the business case usually looks like this: average tokens per ticket times price per token equals cost per ticket. In practice that number is wrong by a factor of three to eight. A realistic support resolution that requires two or three tool calls fires five to eight LLM inferences, each of which carries the accumulated conversation context. By turn seven the input token count on each call has tripled from turn one. A session that runs twice as many turns can easily cost three or four times as much, because later turns are more expensive per turn than earlier turns. None of that shows up in the cost-per-call dashboard, because cost-per-call is an average and the distribution is long-tailed.</p>
<p>The pattern is that the vendor's billing schema is a shape, and the team's optimization target inherits that shape. If the shape is per-token, the team optimizes for shorter outputs. If the shape is per-call, the team optimizes for fewer calls. If the shape is per-seat, the team optimizes for higher seat utilization. None of those targets is necessarily aligned with the unit the customer is paying for, and in many deployments at least one of them is actively misaligned. The shorter output might be the one that leaves the user with a follow-up question. The fewer calls might be the ones that skipped a tool the resolution needed. The higher seat utilization might be the agent handling more tickets per seat at lower per-ticket quality.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-goes-in-the-numerator-that-nobody-puts-in">What goes in the numerator that nobody puts in<a href="https://tianpan.co/blog/2026-06-02-the-agent-budget-that-approved-cost-per-call-and-never-measured-cost-per-resolved-task#what-goes-in-the-numerator-that-nobody-puts-in" class="hash-link" aria-label="Direct link to What goes in the numerator that nobody puts in" title="Direct link to What goes in the numerator that nobody puts in" translate="no">​</a></h2>
<p>When teams do try to compute cost-per-resolved-task, the second mistake is usually the numerator. The instinct is to count only the tokens that produced the resolution: the successful path, the accepted output, the call that closed the ticket. Everything else — the abandoned conversations, the failed tool calls that triggered retries, the timeouts, the runs that escalated to a human after consuming half a model's worth of context — gets bucketed as overhead or quietly ignored.</p>
<p>The correct numerator is total fully loaded spend on that workflow over the period, including every failed attempt, every retry, every abandoned session, every escalation, and every shadow-mode evaluation the workflow triggered. The denominator is accepted outcomes only. A run that consumed forty thousand tokens and ended in an escalation contributes to the numerator and not to the denominator. So does a run that the user abandoned at turn nine. So does a run that an internal eval flagged as low-quality and re-ran. The result is a number that looks, at first, alarmingly high, and that is the point. The first time a team computes cost-per-accepted-outcome honestly, the number is typically three to eight times what the API math suggested. That gap is the cost of every path that did not end in the business getting what it paid for.</p>
<p>A useful refinement is to break the numerator out by failure mode. Tag each run with one of a small set of outcome states — accepted, rejected, abandoned, timed out, tool error, escalated — and attribute its cost to a bucket. Now you can report a Failure Cost Share alongside cost-per-outcome: the percentage of total workflow spend that produced no business-acceptable result. When Failure Cost Share moves, it tells you which class of failure is driving the unit economics this quarter, and the optimization conversation shifts from "make tokens cheaper" to "make this specific failure mode rarer."</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-optimization-loop-you-accidentally-trained">The optimization loop you accidentally trained<a href="https://tianpan.co/blog/2026-06-02-the-agent-budget-that-approved-cost-per-call-and-never-measured-cost-per-resolved-task#the-optimization-loop-you-accidentally-trained" class="hash-link" aria-label="Direct link to The optimization loop you accidentally trained" title="Direct link to The optimization loop you accidentally trained" translate="no">​</a></h2>
<p>Once cost-per-call is the metric on the wall, the engineering team's optimization loop adapts to it. Within a quarter, the optimizations that get shipped are the ones that move that number. Shorter prompts, more aggressive caching, smaller models on the orchestrator, fewer tool calls per turn. Each of these is a real engineering choice with real trade-offs, but the trade-offs are not visible in the cost-per-call metric — they show up downstream, in resolution rate, in turn count, in escalation rate, in CSAT. Those are different dashboards, owned by different teams, and they move on different cadences.</p>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="ai-agents" term="ai-agents"/>
        <category label="finops" term="finops"/>
        <category label="unit-economics" term="unit-economics"/>
        <category label="observability" term="observability"/>
        <category label="metrics" term="metrics"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Agent Plan That Branched on a Fact Your Context Pruner Already Dropped]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-agent-plan-that-branched-on-a-fact-your-context-pruner-already-dropped</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-agent-plan-that-branched-on-a-fact-your-context-pruner-already-dropped"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[When a context pruner evicts a tool result that a later plan step silently depends on, the agent keeps branching against evidence that no longer exists — and the trace looks like a hallucination.]]></summary>
        <content type="html"><![CDATA[<p>A long-running agent generates a plan at step 3. The plan reads something like: "if the order returned by <code>get_order</code> in step 1 has status <code>shipped</code>, send the customer a tracking email; otherwise open a refund ticket." The agent confidently picks the email branch. The customer never received a tracking number, because the order was actually in <code>pending</code>. You go to the trace expecting to find a hallucination. What you find is worse: the step-1 tool result is no longer in context. The pruner evicted it between step 2 and step 3 — it ranked low on recency and there was a 12KB transcript to make room for. The plan still ran. The branch was still chosen. The decision now points at evidence that does not exist.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Agent%20Plan%20That%20Branched%20on%20a%20Fact%20Your%20Context%20Pruner%20Already%20Dropped" alt="" class="img_ev3q"></p>
<p>This is not a model failure in the usual sense. The model produced a syntactically valid plan, executed it in order, and made a branch decision. The branch was made against a fact that used to be in context and is not anymore. The chain of thought encoded the condition (<code>if status == "shipped"</code>); the actual status got dropped on the way to the step that needed it. The plan looks deterministic, but it has been quietly cut loose from its evidence.</p>
<p>What makes this class of bug particularly bad is that it does not look like a bug. There is no exception. There is no schema violation. There is no hallucination in the obvious sense — the model is not inventing a status; it is recalling a status it saw earlier and acting on it. From the outside, the agent looks confident, fast, and wrong. Inside the trace, the evidence that justified the decision has been garbage-collected.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-pruner-doesnt-know-about-the-plan">The Pruner Doesn't Know About the Plan<a href="https://tianpan.co/blog/2026-06-02-the-agent-plan-that-branched-on-a-fact-your-context-pruner-already-dropped#the-pruner-doesnt-know-about-the-plan" class="hash-link" aria-label="Direct link to The Pruner Doesn't Know About the Plan" title="Direct link to The Pruner Doesn't Know About the Plan" translate="no">​</a></h2>
<p>Context pruning, in most production setups today, is a stateless filter. It runs between agent turns. It looks at the message list, applies some policy — drop tool results older than N turns, keep messages under a token threshold, prefer recency, hold onto the system prompt — and returns a smaller message list. Anthropic's <code>clear_tool_uses_20250919</code> strategy, for example, drops old tool results after a configurable token threshold and keeps only the N most recent. Other harnesses summarize old turns into a compact note, or selectively retain messages by an embedding-based relevance score.</p>
<p>All of these strategies share a property: they are oblivious to the plan. They do not know that step 3 of the plan reads "branch on the result of step 1." They cannot, because the plan is not a first-class object the pruner can inspect. The plan lives inside the model's reasoning tokens. The pruner lives outside, in the harness. There is no edge between "this plan step depends on this tool result" and "this tool result is a candidate for eviction."</p>
<p>The consequence is that the pruner is applying a stateless decision rule to a stateful planning trace. Anything the plan needs that the pruner does not happen to retain becomes invisible to the next step. The model, with admirable composure, fills in the missing fact from its in-context memory of having seen it earlier — except that memory is now a residue inside the reasoning trace rather than a fact it can verify. The branch fires anyway.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-symptom-looks-like-hallucination">The Symptom Looks Like Hallucination<a href="https://tianpan.co/blog/2026-06-02-the-agent-plan-that-branched-on-a-fact-your-context-pruner-already-dropped#the-symptom-looks-like-hallucination" class="hash-link" aria-label="Direct link to The Symptom Looks Like Hallucination" title="Direct link to The Symptom Looks Like Hallucination" translate="no">​</a></h2>
<p>When this fails in production, the most natural place to start debugging is "the model hallucinated." That diagnosis is wrong, but understandably so. The symptom matches: the model asserted a fact (the order is shipped) that turned out to be false (it was pending), and acted on that assertion. Every external tool worked. The system prompt is unchanged. The model version is unchanged. By process of elimination, it must be the model.</p>
<p>The actual sequence is closer to a use-after-free. The fact was true when the model first observed it (or true enough — perhaps the order really was shipped at step 1 and changed between then and the branch, which is its own problem). The fact got referenced in the plan. The fact got pruned. The reference is dangling. The model, having no signal that the underlying memory is gone, dereferences it.</p>
<p>Production agent debugging tools have started to make this debuggable, but only just. Step-level tracing now captures every tool call, reasoning chain, and pruning event into a single timeline. If you have that, you can replay the pruning decision and see which tool results were dropped between step N and step N+1. What you typically still cannot see is which of those dropped results were <em>needed</em> by a future step, because nobody encoded that dependency.</p>
<p>The Chroma research on context rot — testing 18 frontier models across input lengths — shows that information accuracy follows a U-shaped curve: high at the start and end of context, 30%+ lower in the middle. Pruners that drop the middle are doing the right thing for that curve. But the middle is also where most plan-relevant tool results sit by the time the agent is six or eight steps in. The pruner's policy and the plan's needs disagree about which bytes are valuable, and the pruner wins because it runs first.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-just-keep-more-context-doesnt-fix-it">Why "Just Keep More Context" Doesn't Fix It<a href="https://tianpan.co/blog/2026-06-02-the-agent-plan-that-branched-on-a-fact-your-context-pruner-already-dropped#why-just-keep-more-context-doesnt-fix-it" class="hash-link" aria-label="Direct link to Why &quot;Just Keep More Context&quot; Doesn't Fix It" title="Direct link to Why &quot;Just Keep More Context&quot; Doesn't Fix It" translate="no">​</a></h2>
<p>The first reflex is to raise the keep-count or push out the eviction threshold. Keep 20 tool results instead of 5. Use the million-token window your provider just shipped. Stop pruning. This works in the demo and falls over in production for two reasons.</p>
<p>First, context rot is real and cumulative. Chroma's data is unambiguous: every frontier model gets worse as input length increases, and the degradation starts well before the advertised context limit. A 1M-token window still rots at 50K tokens. Holding more context to avoid the eviction problem creates a different problem — the model becomes less reliable at finding the fact it needs even when the fact is right there.</p>
<p>Second, the cost math gets ugly fast. Long-running agents — coding agents are the obvious worst case — accumulate context exponentially as they search, read files, run tools, and backtrack. Coding agents at 35-minute task durations see success rates drop sharply, and doubling task duration quadruples the failure rate. The teams who solved this did not solve it by buying more tokens. They solved it by being more deliberate about which tokens stayed.</p>
<p>So pruning is not optional, and the pruner cannot just be told "keep everything." It has to know which things to keep. And right now, in most architectures, it has no way to know.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="plan-steps-need-dependency-edges-to-their-evidence">Plan Steps Need Dependency Edges to Their Evidence<a href="https://tianpan.co/blog/2026-06-02-the-agent-plan-that-branched-on-a-fact-your-context-pruner-already-dropped#plan-steps-need-dependency-edges-to-their-evidence" class="hash-link" aria-label="Direct link to Plan Steps Need Dependency Edges to Their Evidence" title="Direct link to Plan Steps Need Dependency Edges to Their Evidence" translate="no">​</a></h2>
<p>The fix is structural. Treat the plan as a graph, not as a string. When the model generates a plan, capture which prior tool results, parameters, or facts each step depends on, and store those dependencies as first-class edges next to the plan. The pruner gets a new input: not just the message list, but a dependency manifest that says "tool result <code>tc_42a</code> is referenced by plan step 5, which has not yet executed; do not evict."</p>
<p>This is the same discipline that database query planners use to keep intermediate results alive while a query is executing — you do not garbage-collect a hash table that a later operator is going to probe against. It is the same discipline that distributed schedulers like Kubernetes use to pin pods to nodes they have declared affinity for. Agents have, so far, been written without the analog.</p>
<p>What the dependency manifest looks like in practice:</p>
<ul>
<li class="">A unique ID on every tool call result. Many SDKs already give you one.</li>
<li class="">A planning step that, alongside the natural-language description, emits an explicit list of input IDs it will consume.</li>
<li class="">A pruner that consults the manifest before evicting anything, treats any ID referenced by an unexecuted plan step as pinned, and only evicts results whose downstream dependencies have all completed.</li>
</ul>
<p>This sounds heavyweight. In practice it is a small JSON object next to the plan, plus a five-line check in the pruner. The behavioral difference is large: plans no longer execute against vanished evidence, because the things plans depend on cannot vanish until the plan is done with them.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="validators-at-the-step-boundary">Validators at the Step Boundary<a href="https://tianpan.co/blog/2026-06-02-the-agent-plan-that-branched-on-a-fact-your-context-pruner-already-dropped#validators-at-the-step-boundary" class="hash-link" aria-label="Direct link to Validators at the Step Boundary" title="Direct link to Validators at the Step Boundary" translate="no">​</a></h2>
<p>Pinning is necessary but not sufficient. Even with pinning, two things can still happen: a plan can be updated mid-execution to reference a result that has already been evicted, or a step can implicitly depend on context that was never registered as an input. Both of these will keep happening in real systems, because both the plan and the dependency manifest are LLM-generated and will sometimes be wrong.</p>
<p>The second piece is a per-step context validator. Before a plan step runs, the validator checks that every dependency the step claims to need is still present in context. If it is not, the step does not silently proceed. It either pauses to re-fetch the missing artifact, raises a structured error the orchestrator can catch, or triggers a re-planning pass with the surviving context as input.</p>
<p>This is the same pattern as a database transaction reading at a snapshot and noticing the snapshot is no longer available. The transaction does not pretend the data was the same as the last snapshot it remembers. It aborts and starts over, or escalates. The agent analog: a step that requires <code>tc_42a</code> and cannot find it must not improvise. It must either resurface the dependency or admit it cannot continue.</p>
<p>What you give up is a small amount of latency on edge cases — re-fetching a tool result that was pruned, or replanning on the surviving context. What you gain is that the agent stops executing plans whose justification is no longer in the room. Most teams I have seen instrument this discover that the rate of dropped-dependency events is low single digits per session — but the rate of <em>bad outcomes</em> attributable to those events is dramatically higher, because they bypass every other safeguard.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="coupling-the-pruner-and-the-planner">Coupling the Pruner and the Planner<a href="https://tianpan.co/blog/2026-06-02-the-agent-plan-that-branched-on-a-fact-your-context-pruner-already-dropped#coupling-the-pruner-and-the-planner" class="hash-link" aria-label="Direct link to Coupling the Pruner and the Planner" title="Direct link to Coupling the Pruner and the Planner" translate="no">​</a></h2>
<p>The architectural realization underneath all of this is that context pruning and plan execution are not independent subsystems. They share a contract: the plan is allowed to reference any context that has not been evicted, and the pruner is not allowed to evict anything the plan is still going to reference. Most production agent frameworks ship these subsystems separately, owned by different parts of the harness, and the integration is "they both touch the message list."</p>
<p>When the integration is that loose, you get exactly the failure described at the top: a plan whose execution diverges from its reasoning, because the pruner did its job correctly under its own policy and the planner did its job correctly under its own assumptions, and the assumptions did not match. Nobody owns the edge case where they disagree.</p>
<p>The teams shipping reliable long-running agents are starting to treat this seam as load-bearing. Some build a single context manager that handles both planning state and pruning policy, so the two cannot drift. Some emit explicit plan-context dependency events into their trace format, so observability can flag dropped-dependency cases automatically. Some go further and treat the plan itself as part of the prunable context — with rules — so that if the plan references vanished evidence, the plan itself gets re-derived rather than re-executed against ghosts.</p>
<p>The common thread: stop treating the pruner as a generic context-shrinker and start treating it as a participant in plan execution. The pruner has to know what the plan wants. The plan has to declare what it needs. When the two agree, the agent stops branching on facts that are no longer there.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-to-audit-on-monday">What to Audit on Monday<a href="https://tianpan.co/blog/2026-06-02-the-agent-plan-that-branched-on-a-fact-your-context-pruner-already-dropped#what-to-audit-on-monday" class="hash-link" aria-label="Direct link to What to Audit on Monday" title="Direct link to What to Audit on Monday" translate="no">​</a></h2>
<p>If your agent makes multi-step plans and prunes context between steps — almost every long-running agent does both — there is one audit that surfaces this problem quickly. Pick a sample of recent failed sessions where the failure was "the agent made a wrong decision but no error was raised." For each one, reconstruct the message list at the moment the wrong decision was made. Ask: was the evidence the plan implicitly cited still in the context window at that step?</p>
<p>If the answer is "no" more than a handful of times across your sample, you have a coupling problem between the planner and the pruner, and adding more guardrails to either one in isolation will not fix it. The next agent failure that looks like a hallucination is, more often than people want to admit, a context dependency that nobody declared and the pruner therefore ate. Decisions made against evidence that no longer exists will keep happening until the pruner stops being allowed to make that decision alone.</p>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="ai-agents" term="ai-agents"/>
        <category label="context-engineering" term="context-engineering"/>
        <category label="observability" term="observability"/>
        <category label="reliability" term="reliability"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Agent Rollout Cadence Your Customer Success Team Could Not Absorb]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-agent-rollout-cadence-your-customer-success-team-could-not-absorb</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-agent-rollout-cadence-your-customer-success-team-could-not-absorb"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[When the AI team ships behavior changes weekly behind feature flags but customer success trains monthly, the gap shows up as customer trust quietly collapsing. The fix is a coordination contract, not more meetings.]]></summary>
        <content type="html"><![CDATA[<p>The customer pasted the agent's answer into a support chat and asked the human rep to confirm it. The rep, looking at the same product, said the opposite. The customer did not lose trust in the agent that day. They lost trust in the company, because two parts of it told them two different things in the same hour.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Agent%20Rollout%20Cadence%20Your%20Customer%20Success%20Team%20Could%20Not%20Absorb" alt="" class="img_ev3q"></p>
<p>Nothing was broken. The AI team had shipped a prompt change on Tuesday behind a feature flag, ramped it to 100% by Thursday, and moved on. The customer success team's enablement cycle is monthly — that is how every other product feature has always landed, and nobody re-negotiated the contract for AI. The macro in the CS rep's queue and the FAQ doc on the public site still described the previous behavior. The agent was correct. The rep was correct against the documentation they had. The company was incoherent.</p>
<p>This is the failure mode that does not show up in the eval scores or the engagement deltas the AI team watches. It shows up in CSAT, in ticket volume, in churn cohorts a quarter later, and in the conference-room conversation where the head of CX asks the head of AI to please stop shipping for a few weeks so their team can catch up. The answer "we can't, this is how we move" is technically true and operationally untenable.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-cadence-decoupling-nobody-decided">The cadence decoupling nobody decided<a href="https://tianpan.co/blog/2026-06-02-the-agent-rollout-cadence-your-customer-success-team-could-not-absorb#the-cadence-decoupling-nobody-decided" class="hash-link" aria-label="Direct link to The cadence decoupling nobody decided" title="Direct link to The cadence decoupling nobody decided" translate="no">​</a></h2>
<p>Most companies arrived at this by accident. The AI team adopted feature flags, canary rollouts, and continuous deployment because that is how AI products are built now — you cannot tune an agent against production traffic if you ship quarterly. The customer success team did not change. Their enablement cadence was designed around a release calendar that landed major features every four to six weeks, with training material prepared in advance and a rollout window during which CS could absorb the change.</p>
<p>Those two cadences worked when they were both slow. They also worked when the AI team was small and the CS team could just Slack the founding engineer to ask what changed. They stop working at the moment the AI team's deploys start exceeding the CS team's training throughput, which happens earlier than anyone expects — typically the week the AI team grows past four people and starts running multiple parallel experiments.</p>
<p>The decoupling is rarely a decision. It is the absence of a decision. The AI team's deploy cadence accelerated because that is what the technology allowed; the CS team's enablement cadence held because nobody asked them to change it. The mismatch is the unconscious default of two functions optimizing locally. By the time someone notices, the gap has already been costing customer trust for months.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-the-ai-team-measures-and-what-the-cs-team-measures">What the AI team measures and what the CS team measures<a href="https://tianpan.co/blog/2026-06-02-the-agent-rollout-cadence-your-customer-success-team-could-not-absorb#what-the-ai-team-measures-and-what-the-cs-team-measures" class="hash-link" aria-label="Direct link to What the AI team measures and what the CS team measures" title="Direct link to What the AI team measures and what the CS team measures" translate="no">​</a></h2>
<p>The AI team measures rollout success in eval scores, A/B test deltas, and engagement metrics. A new prompt either improves the win rate against a held-out set or it does not. A new tool either gets called more often with better outcomes or it does not. The dashboards are quantitative and the cycle time is days.</p>
<p>The CS team measures success in ticket volume, time-to-resolution, CSAT, and the rate at which their reps' macros still match reality. None of those metrics move on the AI team's dashboard. None of the AI team's metrics move on the CS team's. The two functions are running on different telemetry, looking at different surfaces, and reconciling nothing.</p>
<p>The worst version of this is when both teams are individually succeeding by their own metrics. The AI team's eval score is up two points. The CS team's tickets are up fifteen percent. Both teams report green. The intersection — that the ticket increase is downstream of the eval improvement, because the new agent behavior contradicts the documented one — is invisible to either dashboard.</p>
<p>This is the data architecture failure underneath the org failure. If neither team's metrics surface the cost the other team is bearing, the coordination problem cannot be detected by looking at telemetry. It can only be detected by talking to customers, which is a much slower feedback loop than either team is used to operating on.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-release-notes-feed-as-a-coordination-contract">The release-notes feed as a coordination contract<a href="https://tianpan.co/blog/2026-06-02-the-agent-rollout-cadence-your-customer-success-team-could-not-absorb#the-release-notes-feed-as-a-coordination-contract" class="hash-link" aria-label="Direct link to The release-notes feed as a coordination contract" title="Direct link to The release-notes feed as a coordination contract" translate="no">​</a></h2>
<p>The first concrete fix is treating behavior changes the agent will exhibit as a first-class release artifact, with a feed scoped specifically to consumers downstream of the AI team. Not a Slack ping in the AI team's channel. Not a line in a sprint review nobody outside engineering attends. A structured feed — call it a behavior changelog — that lands with enough lead time for CS to update macros, train reps, and brief frontline workers before traffic ramps to 100%.</p>
<p>The discipline is harder than it sounds because it requires the AI team to articulate, in plain prose, what behavior will change. "We updated the refund-handling prompt" does not count. "Starting Thursday, the agent will offer a partial refund on shipping for orders over $50 that are delayed more than 48 hours, instead of routing to a human" does. The second is what CS needs in order to update the rep-facing macro and the public FAQ. The first is what the AI team naturally writes and what the CS team cannot act on.</p>
<p>The translation from a prompt diff to a behavior diff is a different skill than writing the prompt. It is the same skill technical writers exercise when they turn an API changelog into a release note for SDK users. Treat it that way: a dedicated role, or at minimum a dedicated step in the release process, that lives between the engineering change and the downstream consumer.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-cs-acknowledgement-gate">The CS acknowledgement gate<a href="https://tianpan.co/blog/2026-06-02-the-agent-rollout-cadence-your-customer-success-team-could-not-absorb#the-cs-acknowledgement-gate" class="hash-link" aria-label="Direct link to The CS acknowledgement gate" title="Direct link to The CS acknowledgement gate" translate="no">​</a></h2>
<p>A feed that nobody reads is not a coordination mechanism. The second fix is a gate: behavior changes do not reach 100% of traffic until the CS team has acknowledged them and updated the corresponding artifacts.</p>
<p>This sounds like it slows the AI team down. In practice, it slows the AI team down by exactly the amount it should have been slowed down all along — the amount needed for the rest of the company to absorb the change. It also forces the AI team to size its changes against the absorption rate of its downstream consumers, which is the constraint they have been ignoring.</p>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="ai-agents" term="ai-agents"/>
        <category label="customer-success" term="customer-success"/>
        <category label="rollout" term="rollout"/>
        <category label="engineering-leadership" term="engineering-leadership"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Agent Runbook Your Incident Commander Could Not Execute]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-agent-runbook-your-incident-commander-could-not-execute</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-agent-runbook-your-incident-commander-could-not-execute"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Most agent runbooks read fine in daylight and run blocked at 02:17 because the author has access the on-call SRE does not. Federation, declared scopes, break-glass endpoints, and drills are the fix.]]></summary>
        <content type="html"><![CDATA[<p>The page fires at 02:17 local time. The on-call SRE pulls up the agent runbook on their phone and reads step one: "check the agent's tool-call traces for anomalous tool usage." They open the link. They hit an SSO prompt for a workspace they do not belong to. Step two says inspect the prompt-construction logs; same wall. Step three says roll back to the previous prompt version, but the deploy permission is scoped to a team they are not on. By the time they figure out which Slack channel to escalate to and wake up the AI team's product manager because she is the only person they can find at 02:17, ninety minutes have passed and the customer-visible regression is still serving wrong answers.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Agent%20Runbook%20Your%20Incident%20Commander%20Could%20Not%20Execute" alt="" class="img_ev3q"></p>
<p>The post-mortem will identify the access gap as the proximate cause. The deeper discomfort is that the runbook reads fine in daylight and runs blocked at night, because the person who wrote it has access the person who executes it does not.</p>
<p>This is the failure mode that quietly waits inside almost every agent product that survived its first quarter in production. The AI team built the agent, the observability for the agent, and the deploy pipeline for the agent. They wrote the runbook against the workflow they use to debug it. The runbook is technically correct. It is operationally undeliverable to the person who actually executes it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-author-persona-is-not-the-reader-persona">The Author Persona Is Not the Reader Persona<a href="https://tianpan.co/blog/2026-06-02-the-agent-runbook-your-incident-commander-could-not-execute#the-author-persona-is-not-the-reader-persona" class="hash-link" aria-label="Direct link to The Author Persona Is Not the Reader Persona" title="Direct link to The Author Persona Is Not the Reader Persona" translate="no">​</a></h2>
<p>Every runbook has two personas, and most agent runbooks confuse them. The author persona is the engineer who built the system, knows where the traces live, has credentials for every backing service, and can describe the failure modes in the vocabulary of the codebase. The reader persona is whoever is paged at 02:17. In most organizations these are different people, and in organizations with a dedicated AI platform team they are reliably different people on reliably different on-call rotations with reliably different access.</p>
<p>Conventional service runbooks survived this gap because the service team and the SRE rotation had been negotiating it for years. There was an unspoken contract: anything in the runbook had to be executable from the access profile of the central oncall. Dashboards rendered in Grafana, not in a team-specific tool. Logs went to the central log store, not a private S3 bucket. Deploys went through the shared deploy console. When a service team forgot this contract, the SRE team noticed during the first drill, sent a stern message, and the runbook got rewritten.</p>
<p>Agent runbooks broke the contract because the AI platform team typically did not exist when the contract was negotiated. They were stood up fast, they own their own observability stack for velocity or cost reasons, and they have their own deploy pipeline because prompts are not code and code review does not catch prompt regressions. None of that is wrong. What is wrong is that the runbook they ship to oncall reads like the readme for their own debugging workflow, with no acknowledgement that the person executing it does not have their tools.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="federation-is-the-word-you-are-avoiding">Federation Is the Word You Are Avoiding<a href="https://tianpan.co/blog/2026-06-02-the-agent-runbook-your-incident-commander-could-not-execute#federation-is-the-word-you-are-avoiding" class="hash-link" aria-label="Direct link to Federation Is the Word You Are Avoiding" title="Direct link to Federation Is the Word You Are Avoiding" translate="no">​</a></h2>
<p>The cheap fix everyone tries first is to add the SRE rotation to the AI platform's tools. Grant them SSO into the prompt observability dashboard. Add them to the deploy group. Issue them credentials for the trace store. This works for one rotation, fails the next time someone joins or leaves, and creates an access surface the security team is going to ask hard questions about during the next audit. It is not federation. It is access sprawl with extra steps.</p>
<p>The right move is to push the AI platform's telemetry up into the observability surface oncall already uses. Pick a vendor-neutral instrumentation standard, OpenTelemetry being the obvious one, and emit agent traces, prompt construction logs, and tool-call decisions through it. Federate the resulting data into the central observability stack. The IC opens the same Grafana board they would for any service, sees the agent's behavior alongside everything else, and does not need a separate set of credentials to see it.</p>
<p>This is more work than handing out logins, which is exactly why teams default to handing out logins. The work pays back the first time someone joins the SRE rotation and the AI team does not get a JIRA ticket about it. It pays back the second time during an incident that touches three services and the IC does not have to switch tools between them. The federation effort is one of the few infrastructure investments where the payoff is invisible until you suddenly need it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="runbook-authoring-as-a-permission-contract">Runbook Authoring as a Permission Contract<a href="https://tianpan.co/blog/2026-06-02-the-agent-runbook-your-incident-commander-could-not-execute#runbook-authoring-as-a-permission-contract" class="hash-link" aria-label="Direct link to Runbook Authoring as a Permission Contract" title="Direct link to Runbook Authoring as a Permission Contract" translate="no">​</a></h2>
<p>Once federation exists, the runbook itself needs a discipline most teams have never imposed: each step must declare what access it requires, and a pre-merge check has to verify that the on-call rotation actually has that access.</p>
<p>This sounds bureaucratic until you have shipped one. A runbook step that reads "roll back the prompt to the previous version" is actually a permission contract: it asserts that the reader holds a deploy-rollback scope on the prompt registry. Make that assertion explicit. Tag the step with the scope it requires. At merge time, validate the tag against the membership of the on-call rotation. If the rotation does not hold the scope, the runbook does not merge until either the scope is granted, the step is rewritten to use a break-glass mechanism, or a different rotation is named as the responsible party.</p>
<p>The discipline is the same one we apply to typed function signatures. The runbook step is a function call against the IC's permission set, and an undeclared scope is the runbook equivalent of an untyped argument. It compiles, it looks fine in review, it blows up at runtime when the inputs do not match.</p>
<p>The check itself is not exotic. Most identity providers expose group membership through an API, most deploy systems publish their scope catalog, and the on-call rotation is a list in your paging tool. Wire those three together, add a CI step that fails the runbook PR when the asserted scopes are not held by the rotation, and the failure mode shifts from a 02:17 wall of authentication prompts to a Tuesday-afternoon code review comment.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-break-glass-path-the-ai-team-owes-oncall">The Break-Glass Path the AI Team Owes Oncall<a href="https://tianpan.co/blog/2026-06-02-the-agent-runbook-your-incident-commander-could-not-execute#the-break-glass-path-the-ai-team-owes-oncall" class="hash-link" aria-label="Direct link to The Break-Glass Path the AI Team Owes Oncall" title="Direct link to The Break-Glass Path the AI Team Owes Oncall" translate="no">​</a></h2>
<p>Some steps cannot be wrapped in a permission contract because the answer to "should oncall have this scope?" is no. Deploying a new prompt version requires review by people who understand the prompt. Rotating a tool's API key may need coordination with a downstream team. Granting oncall those scopes permanently is the wrong answer.</p>
<p>What you owe them is a break-glass mechanism scoped to the actions an IC will actually need to take during an incident, audited heavily after the fact. A rollback-only deploy endpoint is the canonical example. It accepts one input, the previous version's identifier, and emits a single artifact: a reverted prompt. It cannot deploy a new prompt, edit an existing one, or change tool wiring. The IC can invoke it without being on the deploy team, every invocation pages the deploy team after the fact for review, and the access surface stays small because the endpoint can only do one thing.</p>
<p>The break-glass pattern is well understood in cloud operations and well understood for AI agent rollback specifically; the failure mode is that teams treat it as an enterprise-grade feature to build later. It is an incident-survival feature to build before the first incident. The unit of rollback for an agent is not just a model version: it is a prompt package, tool contracts, policy layer, memory plane, and runtime permissions all together. The break-glass endpoint should restore a known-good bundle of those, not just flip a model pointer. Restoring half the bundle leaves the agent in a configuration no one tested.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="drills-are-not-optional-when-the-reader-is-not-the-author">Drills Are Not Optional When the Reader Is Not the Author<a href="https://tianpan.co/blog/2026-06-02-the-agent-runbook-your-incident-commander-could-not-execute#drills-are-not-optional-when-the-reader-is-not-the-author" class="hash-link" aria-label="Direct link to Drills Are Not Optional When the Reader Is Not the Author" title="Direct link to Drills Are Not Optional When the Reader Is Not the Author" translate="no">​</a></h2>
<p>Even with federation, declared scopes, and a break-glass endpoint, the runbook will rot. Permissions change. Tools get renamed. The prompt registry adds a step. The only way to keep the runbook executable is to actually execute it, with the IC, end to end, against a synthetic incident, on a cadence.</p>
<p>This is where the AI team's instinct fights them again. They will offer to "test" the runbook themselves. That is not a drill of the runbook, that is a drill of the AI team. The drill that matters is the one where the SRE who would actually be paged at 02:17 opens the runbook cold and walks it. Every step that returns an authentication prompt, every link that goes to a dashboard they cannot read, every tool name that has changed since the runbook was written, every assumption the author made about familiarity with the system, surfaces in that drill. Surface it on a Tuesday afternoon, not on a Saturday night when revenue is bleeding.</p>
<p>Mature service teams already do this; the cultural lift for AI platform teams is to accept that their system is now a multi-team operational liability and to staff the drill cadence accordingly. A reasonable starting cadence is quarterly. Quarterly drills catch most rot without becoming a burden. Once an actual incident reveals a runbook gap, that runbook moves to a monthly drill until two consecutive runs are clean.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architectural-reality-underneath-all-of-this">The Architectural Reality Underneath All Of This<a href="https://tianpan.co/blog/2026-06-02-the-agent-runbook-your-incident-commander-could-not-execute#the-architectural-reality-underneath-all-of-this" class="hash-link" aria-label="Direct link to The Architectural Reality Underneath All Of This" title="Direct link to The Architectural Reality Underneath All Of This" translate="no">​</a></h2>
<p>The realization the org has to internalize is uncomfortable for the AI platform team and obvious to anyone who has run an SRE function. An agent in production is a multi-team operational liability. The team that built it owns its design. The team that operates it owns its runbook. The team that gets paged for it owns its execution. These three teams are different teams, and the agent platform did not exist long enough for them to negotiate the interface.</p>
<p>Until they negotiate it, every runbook the AI team writes is a document that reads correct and runs blocked. Federation closes the observability gap. Declared scopes close the permission gap. Break-glass endpoints close the deploy gap. Drills close the rot gap. None of these is hard infrastructure work. All of them require admitting that the runbook author and the runbook reader are different people, with different access, on different rotations, awake at different times, and that the document only counts as written when the reader can actually run it.</p>
<p>The takeaway for any team running an agent in production: open your runbook tonight, hand it to the next SRE on rotation, and ask them to read it end to end without asking you any questions. Whatever they cannot do, that is your roadmap.</p>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="ai-agents" term="ai-agents"/>
        <category label="sre" term="sre"/>
        <category label="incident-response" term="incident-response"/>
        <category label="runbooks" term="runbooks"/>
        <category label="observability" term="observability"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The AI Feature Your CTO Funded That Your Security Team Will Not Let You Ship]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-ai-feature-your-cto-funded-that-your-security-team-will-not-let-you-ship</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-ai-feature-your-cto-funded-that-your-security-team-will-not-let-you-ship"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[AI features that ship on time treat the security threat model as a shape constraint at the spec stage — not a checklist at the readiness gate. A guide for engineering leaders on moving security upstream.]]></summary>
        <content type="html"><![CDATA[<p>The post-mortem says "we found security too late." The actual finding is that security found you on time. Your process found security too late.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20AI%20Feature%20Your%20CTO%20Funded%20That%20Your%20Security%20Team%20Will%20Not%20Let%20You%20Ship" alt="" class="img_ev3q"></p>
<p>This is the AI feature that cleared the budget gate in January because the CTO and the CFO agreed the company needed an AI moment. It cleared a light legal review in March because it was a prototype. Engineering built against the agreed spec through Q2. In late July, the launch-readiness security review opened, and on day one the threat model came back with blockers on the auth scopes, the data-exfiltration paths, the model provider's residency story, and the prompt-injection surface. The team's quarter is now spent rebuilding to address findings that should have shaped the original spec. Two quarters of slip, an executive memo about "process improvements," and a quiet decision next planning cycle to "deprioritize AI deep-integrations."</p>
<p>The launch did not fail because security was slow. It failed because security entered after the shape of the feature had already been frozen.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-shape-constraint-nobody-drew">The Shape Constraint Nobody Drew<a href="https://tianpan.co/blog/2026-06-02-the-ai-feature-your-cto-funded-that-your-security-team-will-not-let-you-ship#the-shape-constraint-nobody-drew" class="hash-link" aria-label="Direct link to The Shape Constraint Nobody Drew" title="Direct link to The Shape Constraint Nobody Drew" translate="no">​</a></h2>
<p>An AI feature's threat model is not a checklist you stamp at the end. It is a shape constraint. It decides what the feature can be.</p>
<p>Consider a customer-support agent that can read a ticket, query an internal knowledge base, and email a draft response to the user. The PRD describes that user journey in a paragraph. The security threat model describes a different artifact: which scopes the agent's service account holds, whether a user's ticket can contain instructions the agent treats as commands, whether the knowledge base returns documents the agent should refuse to forward, whether the provider's inference endpoint can retain the customer's data, and whether the email draft goes through a human gate.</p>
<p>Those questions do not refine the PRD. They constrain it. The auth scopes determine which tools the agent can be given. The data-exfiltration model determines whether the agent can read free-text from one tenant and write into another. The provider's residency posture determines whether an EU customer can use the feature at all. The prompt-injection surface determines whether the email step can be auto-send or must be human-gated.</p>
<p>If you sign off on a PRD that says "the agent emails the draft to the user" without resolving those constraints, you have not designed a feature. You have written a wish, and the shape of what you can actually ship is going to be discovered later by someone else, against your timeline.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-lightweight-prototype-review-is-the-trap">Why "Lightweight Prototype Review" Is The Trap<a href="https://tianpan.co/blog/2026-06-02-the-ai-feature-your-cto-funded-that-your-security-team-will-not-let-you-ship#why-lightweight-prototype-review-is-the-trap" class="hash-link" aria-label="Direct link to Why &quot;Lightweight Prototype Review&quot; Is The Trap" title="Direct link to Why &quot;Lightweight Prototype Review&quot; Is The Trap" translate="no">​</a></h2>
<p>Non-AI features pass through a lightweight architecture review at the design-doc stage. A new payments path gets a security architect's eyes on it. A new auth integration gets a thirty-minute threat model. A new admin endpoint gets a quick read for IDOR risk. Nobody complains because everyone has internalized that you cannot retrofit authn/authz after the fact without paying a tax.</p>
<p>AI features are skipping that step. The reason is cultural: AI features feel research-shaped. They emerge from a notebook, a prompt, a Loom demo. They look like they have not been "built" yet, so applying the normal architecture review feels premature. Product treats the early build as a prototype. Engineering treats it as a probe of feasibility. Security is not invited because there is supposedly nothing to review.</p>
<p>By the time there is something to review, the auth model has been decided implicitly by what the prototype's service account happened to have access to. The data flow has been decided implicitly by which APIs were easiest to wire up. The provider has been decided implicitly by which SDK the prototyper imported first. The thing that looks like a prototype is already a frozen architecture. It just has not been labeled that way.</p>
<p>The fix is not to slow the prototype down. It is to recognize that an AI prototype's first commit is a higher-stakes design decision than a non-AI feature's design doc, and to apply the same lightweight architecture review accordingly.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-an-ai-threat-model-actually-catches">What An AI Threat Model Actually Catches<a href="https://tianpan.co/blog/2026-06-02-the-ai-feature-your-cto-funded-that-your-security-team-will-not-let-you-ship#what-an-ai-threat-model-actually-catches" class="hash-link" aria-label="Direct link to What An AI Threat Model Actually Catches" title="Direct link to What An AI Threat Model Actually Catches" translate="no">​</a></h2>
<p>The threat surface for an AI agent looks different from a traditional application. It is dynamic, context-dependent, and capable of taking consequential actions in real time, often through tools the security team has not previously catalogued as code paths.</p>
<p>A useful AI threat model catches at least five things that a generic API review misses:</p>
<ul>
<li class=""><strong>Prompt-injection surface.</strong> Every untrusted text the model reads is a potential instruction. A support ticket, a web page the agent fetches, an attachment, a tool result — any of these can carry hidden instructions that re-aim the agent. Microsoft's security team documented prompt injection paths in agent frameworks that escalate from "the model said something weird" to host-level code execution, because once a model is wired to tools, prompt injection is a code-execution primitive, not a content problem.</li>
<li class=""><strong>Tool-scope blast radius.</strong> The agent's service account is the agent's authority. If the agent has a <code>send_email</code> tool and a <code>read_ticket</code> tool, prompt injection through a ticket can trigger an email. The blast radius is not "the model behaved badly" but "the agent took a real action through a tool the team gave it." Threat models force a tool-by-tool capability audit before the tool list is committed.</li>
<li class=""><strong>Cross-tenant context leakage.</strong> Many AI features share a vector index, a prompt-cache, or a fine-tuned adapter across tenants. The retrieval step then becomes the cross-tenant boundary, and if the retrieval keys are wrong by one column, tenant A's documents end up in tenant B's prompt. This is not an exotic failure mode; it is the default for teams that build retrieval before they design isolation.</li>
<li class=""><strong>Provider data path and residency.</strong> If the LLM is processing customer data, the LLM provider is a subprocessor. That triggers a DPA amendment, Standard Contractual Clauses for EU data, an opt-out from training/retention, and possibly a choice of model with GDPR-compatible terms. A SaaS application makes residency decisions at provisioning; an AI agent makes them at inference, which means the residency posture cannot be deferred to the deploy step.</li>
<li class=""><strong>Indirect prompt injection from third-party content.</strong> Indirect prompt injection — instructions hidden in a web page or document the agent reads — was theoretical two years ago and is now observed in the wild against production agents. If your feature reads any content the user did not author, that content is an attack surface, and the threat model has to call out which fetches need an isolation step.</li>
</ul>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="ai-security" term="ai-security"/>
        <category label="threat-modeling" term="threat-modeling"/>
        <category label="engineering-leadership" term="engineering-leadership"/>
        <category label="ai-agents" term="ai-agents"/>
        <category label="product-development" term="product-development"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Annotation Queue Your Humans Quietly Stopped Reading]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-annotation-queue-your-humans-quietly-stopped-reading</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-annotation-queue-your-humans-quietly-stopped-reading"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Annotator throughput is the silent ceiling on every LLM eval program, and the queue ordering is the sampler nobody designed. How to treat sampling-for-grading as a first-class engineering surface.]]></summary>
        <content type="html"><![CDATA[<p>Your eval pipeline emits 800 traces per week for human review. Your annotators have about ninety minutes a week budgeted for it. They open the queue, grade the first three, mark a few more as "skip," and close the tab. The leaderboard you stare at on Monday morning is now a survey of which traces happened to land near the top of the list, not a measurement of system quality.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Annotation%20Queue%20Your%20Humans%20Quietly%20Stopped%20Reading" alt="" class="img_ev3q"></p>
<p>This is not a labeling problem. It is a throughput problem dressed up as a quality problem, and it is one of the quietest ways an evaluation program degrades. The traces still flow. The dashboards still render. The number still moves. What you do not see is that the denominator of your "human-graded eval score" silently shrank to a handful of items chosen by an ordering function nobody designed on purpose.</p>
<p>The pattern is familiar to anyone who has run an on-call rotation past the point of sustainability. The pager keeps firing. Engineers keep clicking acknowledge. The incidents that get a real postmortem are the ones that happened to land during business hours on a quiet day. Everything else gets a one-line note and a green checkbox. The system looks healthy by every metric the team owns. The metric is the failure.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="annotator-throughput-is-the-ceiling-and-it-does-not-scale-with-traffic">Annotator throughput is the ceiling, and it does not scale with traffic<a href="https://tianpan.co/blog/2026-06-02-the-annotation-queue-your-humans-quietly-stopped-reading#annotator-throughput-is-the-ceiling-and-it-does-not-scale-with-traffic" class="hash-link" aria-label="Direct link to Annotator throughput is the ceiling, and it does not scale with traffic" title="Direct link to Annotator throughput is the ceiling, and it does not scale with traffic" translate="no">​</a></h2>
<p>The first thing to internalize is that human grading capacity is a hard, slow-moving ceiling. A trained reviewer can carefully grade somewhere between thirty and a hundred traces per hour, depending on task complexity and how much context they need to reload between cases. A part-time pool of three reviewers, given four hours a week each, tops out around twelve hundred careful grades a week. A serious eval pipeline can produce that many traces in an afternoon.</p>
<p>This asymmetry is not a recruiting problem you can solve with another job posting. Domain-expert annotators are precisely the people whose calendars are already saturated, because they are the same engineers, lawyers, clinicians, and support leads whose judgment the product depends on in the first place. Hiring generic annotators does not help when the question is whether a specific tool call was the right move in a specific customer's account.</p>
<p>So the budget is fixed. Production traffic, by contrast, is exponential. Once your application crosses a few thousand requests a day, the gap between traces emitted and traces a human will actually read passes three orders of magnitude. From that point on, every additional unit of traffic widens the gap. Adding more reviewers buys you a constant factor. The eval pipeline keeps growing geometrically. The ratio gets worse, not better, as your product succeeds.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-queue-order-is-the-silent-sampler-nobody-chose">The queue order is the silent sampler nobody chose<a href="https://tianpan.co/blog/2026-06-02-the-annotation-queue-your-humans-quietly-stopped-reading#the-queue-order-is-the-silent-sampler-nobody-chose" class="hash-link" aria-label="Direct link to The queue order is the silent sampler nobody chose" title="Direct link to The queue order is the silent sampler nobody chose" translate="no">​</a></h2>
<p>When you produce far more traces than humans can grade, the difference between the traces you produce and the traces you grade becomes a sample. Every sample has a sampling function. If you did not design one, the default is whatever ordering your queue happens to use: timestamp, insertion order, trace ID hash, or whichever join key the dashboard query happened to land on.</p>
<p>That default ordering is almost never the right answer, and it is rarely random. Recency bias means the most recent traces dominate. Reviewers fatigue partway down and the tail of the queue is systematically under-graded. If your queue is sorted by anything correlated with the input — user ID, request size, latency — your "human-graded score" is a measurement of the slice that ordering surfaced, and the slice that ordering hid is invisible.</p>
<p>The dangerous part is that this looks like signal. The number is stable week over week because the sampling bias is stable. It feels like a real measurement because it is consistent. It moves when the system changes because some changes do show up in the graded slice. None of that means it generalizes. You can spend a quarter chasing regressions on a subset of traffic the team did not know existed, and a quarter shipping fixes whose impact is invisible because they help the slice no one is reading.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-four-sampling-strategies-and-what-each-one-is-for">The four sampling strategies, and what each one is for<a href="https://tianpan.co/blog/2026-06-02-the-annotation-queue-your-humans-quietly-stopped-reading#the-four-sampling-strategies-and-what-each-one-is-for" class="hash-link" aria-label="Direct link to The four sampling strategies, and what each one is for" title="Direct link to The four sampling strategies, and what each one is for" translate="no">​</a></h2>
<p>Annotation-queue tooling has converged on a handful of sampling strategies. They are not interchangeable. Each one answers a different question and produces a different bias.</p>
<p><strong>Random sampling</strong> gives you an unbiased view of overall quality. If you want to track whether the system is getting better or worse in aggregate, this is the only sampling regime that gives you that answer without correction. The cost is that random sampling spends almost all its budget on the median case and almost none on the tails, which is exactly where regressions live.</p>
<p><strong>Stratified sampling</strong> divides traffic into segments — user tier, feature, request type, conversation length, language — and samples within each. This is what you want when the system behaves differently for different populations and you need to detect a regression in one segment without it being washed out by ten others. Stratification turns a single aggregate number into a panel of segment-level numbers, each of which is honest about its own population.</p>
<p><strong>Priority sampling</strong> pushes the rare or interesting traces to the front: low automated scores, high latency, user-reported issues, traces where two judges disagreed. This is where your budget pays off for finding new failure modes. The cost is that priority-sampled grades are not representative of overall quality — they are representative of "things the priority function flagged," which is a different quantity.</p>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="evals" term="evals"/>
        <category label="annotation" term="annotation"/>
        <category label="observability" term="observability"/>
        <category label="human-in-the-loop" term="human-in-the-loop"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Are-You-Sure Confirmation Step Your Users Learned to Click Through]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-are-you-sure-confirmation-step-your-users-learned-to-click-through</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-are-you-sure-confirmation-step-your-users-learned-to-click-through"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Uniform confirmation prompts in AI agents create habituation: users click through high-stakes actions with the same reflex as low-stakes ones. A stakes-aware friction budget, artifact previews, and instrumented time-to-click rebuild the safety layer.]]></summary>
        <content type="html"><![CDATA[<p>The confirmation dialog is the cheapest safety layer in the AI agent toolbox. It's a string, a button, and a callback. The product manager who asked for it left the meeting believing the agent was now safe. The engineer who built it shipped it in an afternoon. The compliance reviewer who audited it ticked the box. And the user who saw it for the seventh time that morning had already moved their mouse to the Confirm button before their eyes finished reading the title.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Are-You-Sure%20Confirmation%20Step%20Your%20Users%20Learned%20to%20Click%20Through" alt="" class="img_ev3q"></p>
<p>Within a week, the confirmation step is no longer a decision point. It's a rhythm. The agent says "are you sure you want to send this email?" and the user says yes the way they say bless-you at a sneeze. The day the agent proposes an action that is actually wrong — wrong recipient, wrong amount, wrong tone — the user confirms it with the same automaticity they used for the six correct ones before it, and the email goes out, and the team writes a postmortem that says "user error."</p>
<p>It wasn't user error. It was a system that mistook the existence of a click for the existence of consent.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="habituation-is-a-feature-of-the-user-not-a-bug-in-the-dialog">Habituation is a feature of the user, not a bug in the dialog<a href="https://tianpan.co/blog/2026-06-02-the-are-you-sure-confirmation-step-your-users-learned-to-click-through#habituation-is-a-feature-of-the-user-not-a-bug-in-the-dialog" class="hash-link" aria-label="Direct link to Habituation is a feature of the user, not a bug in the dialog" title="Direct link to Habituation is a feature of the user, not a bug in the dialog" translate="no">​</a></h2>
<p>The first thing to understand about confirmation prompts is that habituation is not a failure of attention. It's a successful adaptation. The user's brain encountered a stimulus that fired in identical form every time, with no information that predicted whether the action was risky or routine, and learned — correctly, by every reasonable inference rule — that the stimulus carried no signal worth processing.</p>
<p>A 2015 fMRI study at Brigham Young, University of Pittsburgh, and Google measured the visual cortex response to repeated security warnings and watched it collapse after the second exposure. By the fifth, the warning was, neurologically, not being seen. A separate 2014 SOUPS study found participants clicking through SSL warnings in under two seconds — fast enough that the click was a motor program, not a cognitive event. Another study reported that only 14% of users noticed when the text of a confirmation dialog was changed mid-experiment.</p>
<p>These numbers are not about lazy users. They are about a stimulus that gave the user no reason to keep paying attention, and a user whose attention reasonably went elsewhere. The dialog was designed to feel safe to the engineer who shipped it. It was not designed to remain salient to the user who lived with it.</p>
<p>When an AI agent fires the same generic confirmation before every tool call — "do you want to proceed?" — it is running this same habituation loop, but compressed. Agents act faster and more often than humans do. A user who saw a confirmation dialog from their old SaaS app twice a week now sees them from their agent twenty times a day. The half-life of attention to those dialogs is measured in hours, not months.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-friction-budget-not-a-friction-floor">A friction budget, not a friction floor<a href="https://tianpan.co/blog/2026-06-02-the-are-you-sure-confirmation-step-your-users-learned-to-click-through#a-friction-budget-not-a-friction-floor" class="hash-link" aria-label="Direct link to A friction budget, not a friction floor" title="Direct link to A friction budget, not a friction floor" translate="no">​</a></h2>
<p>The mental model that fails is treating confirmation as a uniform safety net you can drape over every irreversible action. The mental model that works is treating confirmation as a friction budget — a finite resource you can spend on a small set of decisions per session before the user stops processing any of them.</p>
<p>A friction budget has the property that spending it everywhere is the same as spending it nowhere. If every action gets a prompt, every prompt gets the same reflexive click. The high-stakes prompt that matters has been camouflaged by the dozens of low-stakes prompts that don't. The agent that tries to "be safe" by confirming everything has actually flattened the user's signal-to-noise ratio to zero, and is now operating with no confirmation at all — just the appearance of one.</p>
<p>The first move, before any UX work, is to build a stakes classifier on the action itself. Not on the action category — "email" is not a stake; "email to legal counsel quoting an unsigned contract" is a stake. Not on a static list — anything with a static list will be wrong on the action you didn't anticipate. A stakes classifier that can be evaluated per-call, scoring on dimensions like reversibility (can the user undo this in one click, or does it require a phone call?), blast radius (does this affect one record, one team, or one customer base?), and externality (does the result leave the user's control — sent, posted, paid?).</p>
<p>Actions below the threshold get no confirmation at all. The agent just does them, and the audit log records what happened. Actions above the threshold get confirmation that's worth the user's attention precisely because they don't see it often. The user who sees three confirmations a day, each one tied to something they would have wanted to think about, is a user whose confirmations still mean something.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-prompt-has-to-surface-the-irreversibility-not-just-announce-it">The prompt has to surface the irreversibility, not just announce it<a href="https://tianpan.co/blog/2026-06-02-the-are-you-sure-confirmation-step-your-users-learned-to-click-through#the-prompt-has-to-surface-the-irreversibility-not-just-announce-it" class="hash-link" aria-label="Direct link to The prompt has to surface the irreversibility, not just announce it" title="Direct link to The prompt has to surface the irreversibility, not just announce it" translate="no">​</a></h2>
<p>The second move is to stop asking "are you sure" and start showing what the user is committing to. A generic "are you sure you want to send this email?" is a question whose answer is always yes — because the user just told the agent to send it, and the agent is now asking the user to re-affirm the thing they already affirmed. The dialog is asking the user to ratify an intent, not to review an artifact.</p>
<p>A confirmation that earns its friction shows the artifact. The email itself, with the recipient, the subject, the body, and any attachments, rendered in the form the recipient will see it. The transfer, with the source account, the destination, the amount, and the resulting balance. The deletion, with the count of records, a sample of what's in them, and a statement of what depends on them. The user can now exercise judgment because the system has handed them the material for judgment. The preview is the confirmation; the button is just the trigger.</p>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="ai-agents" term="ai-agents"/>
        <category label="ux" term="ux"/>
        <category label="safety" term="safety"/>
        <category label="human-in-the-loop" term="human-in-the-loop"/>
        <category label="design-patterns" term="design-patterns"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Async Tool Call Your Agent Fired and Forgot]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-async-tool-call-your-agent-fired-and-forgot</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-async-tool-call-your-agent-fired-and-forgot"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Function calling treats sync and async tools as the same shape. The agent fires a job, receives an ID, marks the step done — and the work never lands.]]></summary>
        <content type="html"><![CDATA[<p>The clearest sign that an agent's tool-call abstraction is broken is when the trace shows the step marked done and the downstream system shows nothing happened. The model called a tool, received a job ID back, treated the job ID as the answer, and moved on. Three minutes later the actual work either succeeded with nobody listening or failed with the error landing in a log nobody reads. The user sees a confident summary; the operations queue sees a stranded task.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Async%20Tool%20Call%20Your%20Agent%20Fired%20and%20Forgot" alt="" class="img_ev3q"></p>
<p>This is the failure mode the function-calling abstraction quietly enables. JSON schemas describe parameters and return types, but they do not distinguish between "this tool returns a result" and "this tool returns a receipt for an operation whose result you will need to ask about later." The model treats both the same way, because to the planner they look the same — a successful tool call with a non-error payload.</p>
<p>The benchmark numbers are bleak. On Robotouille, an asynchronous planning benchmark that measures whether agents can interleave actions with operations that take real time, ReAct on GPT-4o scores 47% on synchronous variants and 11% on asynchronous ones. The architecture isn't subtly worse at async — it falls apart, because every async tool call is an opportunity for the planner to mistake acknowledgment for completion.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-type-system-has-a-hole-in-it">The Type System Has a Hole in It<a href="https://tianpan.co/blog/2026-06-02-the-async-tool-call-your-agent-fired-and-forgot#the-type-system-has-a-hole-in-it" class="hash-link" aria-label="Direct link to The Type System Has a Hole in It" title="Direct link to The Type System Has a Hole in It" translate="no">​</a></h2>
<p>A function call schema names parameters, return shape, and a one-line description. What it does not name is the temporal contract: does this call return a result, or a promise of a result?</p>
<p>A <code>send_email</code> tool that returns <code>{"status": "sent"}</code> looks identical, at the JSON level, to a <code>start_video_render</code> tool that returns <code>{"job_id": "abc123"}</code>. Both produce a payload. Both come back with no error. The planner has no type-level signal that one of them is done and the other is barely started. Tool authors write descriptions like "starts a render job and returns the job ID" — but that prose is one sentence in a system prompt where the model is juggling dozens of tools, and at runtime the model collapses both calls into the same mental category: "tool succeeded, advance the plan."</p>
<p>The new MCP specification (2025-11-25 revision) acknowledges this gap by adding Tasks as a separate primitive — a durable state machine with explicit states like <code>working</code>, <code>input_required</code>, <code>completed</code>, <code>failed</code>, and <code>cancelled</code>. The point is not the state names. The point is that async work has a <em>kind</em> that synchronous work doesn't, and putting it on a different runtime path stops the planner from confusing the two. Bedrock AgentCore's runtime makes the same separation with explicit <code>add_async_task</code> and <code>complete_async_task</code> calls that the SDK uses to track tasks and manage status pings independently of the model's reasoning loop.</p>
<p>If your function-calling layer treats every tool as a synchronous returns-the-answer call, you have one type for two phenomena. The first time you ship a long-running tool, you have shipped a planner that lies about completion.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-polling-loop-whose-budget-was-set-without-looking-at-the-work">The Polling Loop Whose Budget Was Set Without Looking at the Work<a href="https://tianpan.co/blog/2026-06-02-the-async-tool-call-your-agent-fired-and-forgot#the-polling-loop-whose-budget-was-set-without-looking-at-the-work" class="hash-link" aria-label="Direct link to The Polling Loop Whose Budget Was Set Without Looking at the Work" title="Direct link to The Polling Loop Whose Budget Was Set Without Looking at the Work" translate="no">​</a></h2>
<p>Teams that recognize the async case usually patch it by giving the agent a <code>check_status</code> tool and trusting the planner to call it until the job is done. This works in demos and falls apart in production for one specific reason: the agent's outer loop budget — max turns, max tokens, max wall-clock — is set by people thinking about cost, not by people thinking about how long real operations take.</p>
<p>A typical loop budget is 20 to 30 turns and a few hundred seconds of total execution time. A typical long-running tool is a video render, a large file transcription, a multi-step provisioning job, an ETL pipeline. The operation's typical duration is several minutes. The agent's poll-and-wait loop gives up after roughly ninety seconds because that's all the turn budget allows.</p>
<p>What does the agent report when the loop budget exhausts before the job completes? Almost always: the agent synthesizes a result. Phantom status reports — the model references a job status from an earlier <code>check_status</code> call instead of making a fresh one. Premature collection — the model tries to assemble a final answer from "the job is queued" because that's the most recent observation it has. ID truncation — under context pressure, the model abbreviates or reformats the job ID, and the next <code>check_status</code> call fails because the lookup string is mangled.</p>
<p>The user sees a coherent answer. The operation may still be running. The agent has produced a result indistinguishable from the case where it actually waited. This is worse than a timeout error, because at least a timeout is honest.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="sync-and-async-tools-are-different-abstractions-wearing-the-same-json-schema">Sync and Async Tools Are Different Abstractions Wearing the Same JSON Schema<a href="https://tianpan.co/blog/2026-06-02-the-async-tool-call-your-agent-fired-and-forgot#sync-and-async-tools-are-different-abstractions-wearing-the-same-json-schema" class="hash-link" aria-label="Direct link to Sync and Async Tools Are Different Abstractions Wearing the Same JSON Schema" title="Direct link to Sync and Async Tools Are Different Abstractions Wearing the Same JSON Schema" translate="no">​</a></h2>
<p>The trap is that the wire format lets you pretend they're the same. The two abstractions actually differ on every axis that matters:</p>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="ai-agents" term="ai-agents"/>
        <category label="tool-calling" term="tool-calling"/>
        <category label="mcp" term="mcp"/>
        <category label="async" term="async"/>
        <category label="distributed-systems" term="distributed-systems"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Budget Cap That Fires After the Action Already Shipped]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-budget-cap-that-fires-after-the-action-already-shipped</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-budget-cap-that-fires-after-the-action-already-shipped"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[When the kill-switch fires correctly but the agent has already booked the flight, sent the email, and closed the ticket — why budget caps measured in tokens miss the damage measured in actions, and how to separate spend from irreversibility.]]></summary>
        <content type="html"><![CDATA[<p>A single power user burns through your monthly token budget by 9am on day three. The kill-switch fires correctly — the gateway returns 429, the model calls stop, the bill flatlines. Meanwhile the agent has already booked the flight, sent the email confirmation, and closed the support ticket as resolved. The dashboard says "spend halted." The user says "why did you charge me for a trip I never asked for." Both are right. The budget cap stopped the model from thinking. It did not stop the world from changing.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Budget%20Cap%20That%20Fires%20After%20the%20Action%20Already%20Shipped" alt="" class="img_ev3q"></p>
<p>This is the failure mode that almost every agent budget guardrail ships with: the cap is a signal in the <em>spend</em> plane, but the damage lives in the <em>action</em> plane, and the two planes were wired up with no shared transaction boundary. Telling the model to stop is not the same as telling the world to undo what the model just did.</p>
<p>The pattern is so consistent across teams that you can predict the postmortem before reading it. The runaway is detected. The kill-switch is praised for firing fast. The customer support queue then fills with refund requests for actions that the kill-switch was technically powerless to prevent, because they happened in the half-second between the last tool call and the cap evaluation. "We stopped spending" gets shipped as the win. "We stopped acting" was never actually in scope.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-two-planes-were-never-the-same-loop">The Two Planes Were Never the Same Loop<a href="https://tianpan.co/blog/2026-06-02-the-budget-cap-that-fires-after-the-action-already-shipped#the-two-planes-were-never-the-same-loop" class="hash-link" aria-label="Direct link to The Two Planes Were Never the Same Loop" title="Direct link to The Two Planes Were Never the Same Loop" translate="no">​</a></h2>
<p>Most agent runtimes evolved their cost controls and their tool-execution path independently. Cost controls live at the gateway: token counters, per-user quotas, budget envelopes, throttling. Tool execution lives at the agent runtime: function calls, HTTP requests, database writes, third-party API calls. The gateway sees model traffic. The runtime sees side effects. Neither sees both, and neither owns the question "is this next action recoverable if we stop now."</p>
<p>The result is a check ordering that reads correctly on paper and fails in practice. The flow looks like: model call → response with tool call → gateway counts tokens → tool executes → next model call → gateway checks budget → kill. The cap fires <em>between</em> model calls. It does not fire between <em>tool calls</em>. So a chain that emits one heavy reasoning response followed by three quick tool invocations will execute all three tool calls before the budget check has a chance to weigh in.</p>
<p>Worse, the tool calls in that chain are often the irreversible ones. The model thinks expensively, decides what to do, then issues cheap actions — <code>send_email</code>, <code>confirm_booking</code>, <code>post_message</code>, <code>close_ticket</code>. The budget signal arrives after the cheap actions, because the cheap actions are exactly the ones the budget did not flag as worth pausing for. Cost-as-control breaks because cost was never a good proxy for impact.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="cost-is-not-impact">Cost Is Not Impact<a href="https://tianpan.co/blog/2026-06-02-the-budget-cap-that-fires-after-the-action-already-shipped#cost-is-not-impact" class="hash-link" aria-label="Direct link to Cost Is Not Impact" title="Direct link to Cost Is Not Impact" translate="no">​</a></h2>
<p>Treating tokens as the universal currency of agent risk is the deeper error. A 100-token call to <code>delete_customer_record</code> is a regulatory incident. A 50,000-token reasoning trace that resolves a question without touching any tool is free of external consequence. Budgeting by token, then, conflates two things that have almost no causal relationship — how much the model thought, and how much the world will remember the model thinking.</p>
<p>Once you see the gap, the architectural move is to separate the budget into two distinct accounts: a token account and an action account. The token account is what gateways already measure. The action account is what nobody is measuring, but should be: a per-action "blast radius" weight that captures whether the side effect is internal or external, reversible or irreversible, idempotent or one-shot, soft-state or hard-commit.</p>
<p>A pragmatic taxonomy that works in production:</p>
<ul>
<li class=""><strong>Free</strong>: pure model inference with no tool calls, sandboxed reads, dry-run queries that the tool itself flags as preview.</li>
<li class=""><strong>Cheap</strong>: writes to internal stores that the agent owns and can undo, writes that are idempotent with a known compensating action, sends to staging or test channels.</li>
<li class=""><strong>Expensive</strong>: writes to systems of record, customer-visible communications, money movement, third-party API calls that mutate external state.</li>
<li class=""><strong>One-way</strong>: actions that cannot be compensated even in principle — a deleted record without a tombstone, an email that has been read, a payment captured to a card that is now closed, a real-world event like a printed shipping label or a dispatched courier.</li>
</ul>
<p>The budget cap, once you have this account, no longer fires against tokens alone. It fires against the projected one-way action count for the rest of the session. If the user is at 80% of their token budget and the agent's next planned step is <code>send_external_email</code>, the cap should reject the step regardless of token cost. If the user is at 99% of their token budget and the next step is a sandboxed search, the cap should let it through.</p>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="ai-agents" term="ai-agents"/>
        <category label="guardrails" term="guardrails"/>
        <category label="observability" term="observability"/>
        <category label="reliability" term="reliability"/>
        <category label="cost" term="cost"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Bug Report Against a Model Version You No Longer Serve]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-bug-report-against-a-model-version-you-no-longer-serve</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-bug-report-against-a-model-version-you-no-longer-serve"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A customer's bug report against weights you rotated last month is the moment your model versioning policy stops being internal MLOps and starts being a customer-visible contract.]]></summary>
        <content type="html"><![CDATA[<p>A customer support ticket arrives on a Tuesday. The customer attached a screenshot of an output your product generated six weeks ago. They say it is wrong, or unsafe, or simply not what they expected, and they want it fixed. Your support engineer pastes the prompt back into the same API endpoint and gets a clean, reasonable answer. The bug, as far as the system can tell, does not exist.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Bug%20Report%20Against%20a%20Model%20Version%20You%20No%20Longer%20Serve" alt="" class="img_ev3q"></p>
<p>The bug exists. The model that produced the screenshot does not. Since the customer filed the ticket, the weights behind your <code>v1-chat</code> endpoint have been swapped twice — once for a quality bump, once for a cost optimization — and the original checkpoint is no longer reachable. The customer's "this is broken" is now an unfalsifiable claim against a moving target, and the support team has no path to either confirm it or close it out.</p>
<p>This is not a quirky edge case. It is the predictable consequence of treating model versioning as an internal MLOps concern when it is actually a customer-visible product contract. The endpoint URL is stable. The artifact behind it is not. Until your support workflow, your retention policy, and your customer contract acknowledge that gap, every bug report against a rotated checkpoint will land in the same triage void.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-endpoint-is-not-the-model">The Endpoint Is Not the Model<a href="https://tianpan.co/blog/2026-06-02-the-bug-report-against-a-model-version-you-no-longer-serve#the-endpoint-is-not-the-model" class="hash-link" aria-label="Direct link to The Endpoint Is Not the Model" title="Direct link to The Endpoint Is Not the Model" translate="no">​</a></h2>
<p>The mental model your team built around the endpoint is wrong, and it is wrong in a way that has been working for you up until now. You named the route <code>v1-chat</code> and you wrote in the changelog that "the v1 contract is the schema, not the model." That sentence is technically defensible — the request shape, the response shape, the auth headers, the rate limits, none of those have changed. The model, on the other hand, has been continuously upgraded behind the same URL because that is how the team chose to interpret "v1."</p>
<p>The customer interpreted "v1" differently. To them, the endpoint is a black box that produced a specific output on a specific day, and the name on the door implied that whatever was inside the box would keep behaving the same way. The frontier providers have learned this lesson the hard way and now expose both pinned snapshot IDs and floating aliases — Claude's <code>claude-sonnet-4-5-20250929</code> versus <code>claude-sonnet-4-5</code>, OpenAI's dated suffixes versus the bare family names. The pinned snapshot is a promise that the weights and the configuration behind that ID will not change for the lifetime of the ID. The floating alias is a convenience that explicitly forfeits that promise.</p>
<p>If you only offer the floating alias, you have shipped a contract where the substance is mutable. The customer cannot pin even if they want to. The fact that you have engineering reasons for rotating the checkpoint — newer weights are cheaper, safer, better — does not change what the customer signed for, which was the behavior they saw during the trial. The endpoint name and the served artifact need to be two distinct things, and the customer needs the ability to name the latter when something goes wrong with it.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-triage-path-that-doesnt-exist">The Triage Path That Doesn't Exist<a href="https://tianpan.co/blog/2026-06-02-the-bug-report-against-a-model-version-you-no-longer-serve#the-triage-path-that-doesnt-exist" class="hash-link" aria-label="Direct link to The Triage Path That Doesn't Exist" title="Direct link to The Triage Path That Doesn't Exist" translate="no">​</a></h2>
<p>Once the artifact is mutable, the entire support workflow needs to absorb that. Most don't. The default support engineer's playbook assumes a deterministic, reproducible system: take the inputs from the bug report, re-run them against the current system, observe whether the bug is still present, escalate if yes, close if no. That playbook quietly breaks the moment "the current system" is not the same system the bug was filed against.</p>
<p>The failure mode goes like this. The customer's screenshot was generated against checkpoint A. The support engineer re-runs it against checkpoint C, which is what's behind the endpoint today. The output is different — sometimes better, sometimes just different in ways that do not exhibit the original problem. The engineer closes the ticket as "cannot reproduce." The customer either reopens it with more screenshots of the same checkpoint-A behavior they cannot regenerate, or quietly loses faith in the product. Neither outcome is a fix.</p>
<p>What the workflow is missing is a triage path for bugs filed against retired artifacts. It needs three things the current ticket pipeline almost certainly does not have: the model version that produced the original output captured in the ticket itself, a way to route a non-reproducible bug to the team that retired the checkpoint rather than dead-ending it at support, and a policy decision — made once, written down — about what the company will tell a customer when the bug is real but the artifact is gone. "We changed the model and we cannot reproduce your issue" is a perfectly honest answer. The problem is that no one has ever said it out loud, so the support engineer makes it up on the fly and the customer hears improvisation.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-customers-eval-was-a-procurement-decision">The Customer's Eval Was a Procurement Decision<a href="https://tianpan.co/blog/2026-06-02-the-bug-report-against-a-model-version-you-no-longer-serve#the-customers-eval-was-a-procurement-decision" class="hash-link" aria-label="Direct link to The Customer's Eval Was a Procurement Decision" title="Direct link to The Customer's Eval Was a Procurement Decision" translate="no">​</a></h2>
<p>This part is the one that turns a support inconvenience into a contractual problem. During the trial, the customer ran your endpoint against their own eval set — the one their compliance team approved, the one the procurement committee referenced when they signed off on the contract. The numbers from that eval are now sitting in a deck, in a memo, in a security questionnaire response, in an internal justification document. Those numbers were generated against checkpoint A.</p>
<p>When you silently rotated to checkpoint C, you invalidated the empirical basis for the procurement decision. The customer probably did not notice, because the rotation was silent. But they should have noticed, because the numbers no longer hold. Re-run the same eval today and the scores will differ — perhaps better, perhaps worse, perhaps better on average but worse on the specific subset of cases that mattered to whichever stakeholder pushed the deal across the line. Either way, the document that justified buying your product no longer describes the product they have.</p>
<p>The legally interesting failure happens at renewal time, or at the first audit, when the customer's compliance team pulls the original eval report and asks the AI team to refresh it. The refresh shows different numbers. Now someone has to explain, in writing, why the product the company is paying for behaves differently from the product the company evaluated. The answer — "the vendor upgraded the model" — is fine if you told them and they consented. It is not fine if you didn't, and "we treat the model as an implementation detail behind the endpoint" is not the comfort to a regulated buyer that it is to your platform team.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="model-versioning-as-a-first-class-contract">Model Versioning as a First-Class Contract<a href="https://tianpan.co/blog/2026-06-02-the-bug-report-against-a-model-version-you-no-longer-serve#model-versioning-as-a-first-class-contract" class="hash-link" aria-label="Direct link to Model Versioning as a First-Class Contract" title="Direct link to Model Versioning as a First-Class Contract" translate="no">​</a></h2>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="ai-engineering" term="ai-engineering"/>
        <category label="mlops" term="mlops"/>
        <category label="model-versioning" term="model-versioning"/>
        <category label="llm-ops" term="llm-ops"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The CDN Edge Cache Your AI Feature Could Not Use Because the Response Varies Per User]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-cdn-edge-cache-your-ai-feature-could-not-use</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-cdn-edge-cache-your-ai-feature-could-not-use"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Personalized AI features inherit a different physics from the cached web. The latency SLO your team borrowed from CDN-backed surfaces is structurally unmeetable for per-user generated responses — and what to do about it.]]></summary>
        <content type="html"><![CDATA[<p>The product team set the SLO for the new AI summarizer at 200ms TTFB because that is what the rest of the product hits at p50. Nobody on the call asked where the 200ms came from. It came from a decade of static assets and JSON responses served out of a CDN edge cache with an 85% hit rate, where most requests never reached origin and the ones that did were small. The summarizer is per-user, generated fresh each call, and travels edge → origin → model provider on every request. The SLO was structurally unmeetable on day one. The team discovered this in week six, after the dashboard had been red the whole time.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20CDN%20Edge%20Cache%20Your%20AI%20Feature%20Could%20Not%20Use%20Because%20the%20Response%20Varies%20Per%20User" alt="" class="img_ev3q"></p>
<p>This is a recurring pattern in AI feature launches. The latency bar an organization built on top of one set of physics gets inherited by a feature with completely different physics, and the gap between the inherited target and the achievable floor becomes a months-long mitigation project instead of a Day 0 design constraint. The numbers do not care that the SLO was negotiated with a customer in good faith.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-hidden-gift-of-a-decade-of-static-assets">The Hidden Gift of a Decade of Static Assets<a href="https://tianpan.co/blog/2026-06-02-the-cdn-edge-cache-your-ai-feature-could-not-use#the-hidden-gift-of-a-decade-of-static-assets" class="hash-link" aria-label="Direct link to The Hidden Gift of a Decade of Static Assets" title="Direct link to The Hidden Gift of a Decade of Static Assets" translate="no">​</a></h2>
<p>The cache hit rates that quiet product engineers have grown up with are not a property of the application — they are a property of the workload. Static assets are byte-identical across users. API responses for the catalog page are byte-identical across logged-out users and shareable across cohorts of logged-in ones. The CDN sits a few milliseconds from the user, fields most requests entirely, and only escalates the long tail to origin. The team treats edge latency as the typical case because, for that workload, it is.</p>
<p>The numbers behind this gift are large. In typical setups, TTFB for cached content runs around 37ms versus 136ms for uncached — a roughly 73% penalty on the miss path before you have done any work. And that is just cache-miss-but-origin-is-fast. When origin itself is a fan-out to a model provider hundreds of milliseconds away, the penalty is not a factor of three. It is a factor of ten or more.</p>
<p>The crucial part is that this performance was never something the application team earned. It was a hidden subsidy from how the web was built. Static assets are cacheable because URLs are stable identifiers and bytes are deterministic. JSON catalog responses are cacheable because the relevant inputs are coarse — locale, region, maybe an A/B bucket — and a cache keyed on those fields hits often. The team that built on top of this never had to think about cacheability as a property they were responsible for maintaining. It came in the box.</p>
<p>A personalized AI feature does not come in that box. The prompt prefix contains the user's history, the user's preferences, and the user's most recent action. The output is generated rather than retrieved, and the model is a stateless function whose input distribution is essentially unique per request. Every request misses every cache by construction. The team inherits the SLO and not the physics.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-latency-budget-that-was-a-property-of-a-different-stack">The Latency Budget That Was a Property of a Different Stack<a href="https://tianpan.co/blog/2026-06-02-the-cdn-edge-cache-your-ai-feature-could-not-use#the-latency-budget-that-was-a-property-of-a-different-stack" class="hash-link" aria-label="Direct link to The Latency Budget That Was a Property of a Different Stack" title="Direct link to The Latency Budget That Was a Property of a Different Stack" translate="no">​</a></h2>
<p>The discipline that gets skipped here is restating the latency budget when the cacheability model changes. The conversation that should happen at design time is uncomfortable because it forces a renegotiation with whoever signed off on the customer-facing SLO, but it is the cheapest place to have it. Six weeks later, with a dashboard full of red and a contract obligation in the legal queue, the same conversation costs a quarter.</p>
<p>The mental model that fails here is treating the SLO as a portable target rather than as a property of the workload that produced it. A 200ms TTFB on a cached catalog response and a 200ms TTFB on a per-user generated summary are not the same kind of number. The first is a measurement of how close the CDN edge is to the user. The second would be a measurement of how fast a model can generate the first token of a response that did not exist before the request arrived. The two numbers happen to have the same units, but they belong to different problems.</p>
<p>A useful question to ask before any AI feature launch: what is the floor of the latency this feature can achieve given the caching properties it has, not the caching properties of the surfaces it lives next to? The floor is not the average. The floor is the irreducible minimum: network round-trips you cannot collapse, model TTFT you cannot beat, plus any synchronous setup. If your customer-facing SLO is below the floor, the SLO is broken before you have written code. You can argue the SLO down to the floor before launch, or you can argue with the dashboard for the next two quarters.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="feature-class-taxonomy-latency-budgets-that-match-cacheability">Feature-Class Taxonomy: Latency Budgets That Match Cacheability<a href="https://tianpan.co/blog/2026-06-02-the-cdn-edge-cache-your-ai-feature-could-not-use#feature-class-taxonomy-latency-budgets-that-match-cacheability" class="hash-link" aria-label="Direct link to Feature-Class Taxonomy: Latency Budgets That Match Cacheability" title="Direct link to Feature-Class Taxonomy: Latency Budgets That Match Cacheability" translate="no">​</a></h2>
<p>The first concrete pattern that fixes this is refusing the unified product-wide latency SLO and instead defining a feature-class taxonomy where each class has its own budget calibrated to its cacheability model. A few useful classes:</p>
<ul>
<li class=""><strong>Statically cacheable</strong>: deterministic responses that are identical across users (autocomplete dictionaries, common documentation lookups). p50 TTFB target measured at the edge, no model call required for most hits.</li>
<li class=""><strong>Semantically cacheable</strong>: responses that are not byte-identical but are semantically equivalent across enough users that a vector-similarity cache fires often (FAQ-style answers, common code-review patterns). p50 budget reflects the embedding lookup plus a vector index hit; p99 reflects the cold path through the model.</li>
<li class=""><strong>Per-user generated</strong>: responses whose prompt prefix contains user-specific data that varies on every call. The latency floor is dominated by the model provider's TTFT; the team has no path under it without changing the model or the prefix structure.</li>
</ul>
<p>The honest taxonomy admits that the per-user-generated class will never hit the SLO of the cacheable classes and refuses to inherit their numbers. The customer-facing surface then either accepts the higher latency, or the feature design moves work out of the per-user-generated class and into one of the cacheable ones — usually by separating the personalization layer from the generation layer.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-cache-you-can-still-build-at-the-provider-boundary">What Cache You Can Still Build at the Provider Boundary<a href="https://tianpan.co/blog/2026-06-02-the-cdn-edge-cache-your-ai-feature-could-not-use#what-cache-you-can-still-build-at-the-provider-boundary" class="hash-link" aria-label="Direct link to What Cache You Can Still Build at the Provider Boundary" title="Direct link to What Cache You Can Still Build at the Provider Boundary" translate="no">​</a></h2>
<p>Refusing to inherit the wrong SLO does not mean giving up on caching entirely. It means moving the cache from the CDN edge to the model provider boundary, where the unit of cacheability is different. Two layers are worth building deliberately.</p>
<p><strong>Prompt prefix caching.</strong> Both major providers ship this now. Anthropic's <code>cache_control</code> breakpoints let you mark a stable prefix and pay roughly 10% of the input price on cache reads, against a 25% write premium. OpenAI auto-caches stable prefixes above a threshold and bills cached tokens at 50% of normal input. The shape of the win is the same: if you can hold a large system prompt or a large tool catalog or a large retrieved chunk constant across many requests, you stop paying for it on each call and you cut TTFT meaningfully.</p>
<p>The pattern that breaks prompt prefix caching is exactly the pattern that personalization tends to ship by default: putting the user's name, account ID, or recent history early in the prompt, before any of the stable scaffolding. The cache key is the prefix; if the prefix varies per user, the cache never warms. The fix is mechanical but easy to forget — move all per-request, per-user content after the cached boundary, and keep the system prompt and tool definitions and any cohort-shared context above it. A few minutes of prompt restructuring can be the difference between a 90% cache discount and zero.</p>
<p><strong>Semantic caching.</strong> This is the other layer, and it is fundamentally different from prompt caching. Prompt caching cuts the cost and latency of a call that still happens; semantic caching eliminates the call. You embed the incoming query, look it up against an embedding index of past queries above a similarity threshold, and serve the past response. Reported hit rates in the wild land in the 60-70% range for query-shaped traffic — high enough that the saved cost dominates the embedding-plus-vector-lookup overhead.</p>
<p>Semantic caching has sharper edges than prompt caching. The similarity threshold is a tuning problem with real failure modes: too loose and you serve a wrong answer to a question that looked superficially similar; too tight and the hit rate collapses. And critically, semantic caching for a personalized feature has the same scope problem as edge caching — if the response depends on user history, the cache has to be scoped per-user, which collapses the hit rate because each user's query distribution is small. The win is real for shared question patterns; it does not transfer to genuinely per-user-generated content.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architectural-realization-you-have-to-make-out-loud">The Architectural Realization You Have to Make Out Loud<a href="https://tianpan.co/blog/2026-06-02-the-cdn-edge-cache-your-ai-feature-could-not-use#the-architectural-realization-you-have-to-make-out-loud" class="hash-link" aria-label="Direct link to The Architectural Realization You Have to Make Out Loud" title="Direct link to The Architectural Realization You Have to Make Out Loud" translate="no">​</a></h2>
<p>The underlying realization that gets missed: personalized AI features inherit different physics from the cached web. The CDN economics that quietly powered every prior feature were not a free property of the stack — they were a property of the workload. Static assets and shareable JSON responses earned the cache hit rate; personalized generated content does not.</p>
<p>The teams that ship AI features without naming this end up paying origin-grade latency on every request while their dashboards still display targets calibrated to edge-grade workloads. The dashboards are not lying; the targets are just from the wrong universe.</p>
<p>The team that makes this realization out loud, early, does three things. They publish a feature-class taxonomy and refuse to negotiate a single product-wide latency SLO across classes with different floors. They build the caches that the model provider boundary actually supports — prompt prefix caching for the system prompt, semantic caching for shared query patterns — and they design the prompt structure to make those caches fire. And they tell the customer-facing stakeholders, in writing, that the personalized-generation class has a different latency floor than the rest of the product and that the SLO has to reflect it.</p>
<p>The team that does not make the realization keeps debugging a dashboard that was red the moment the feature shipped. The work to migrate to the right SLO is the same either way. The only difference is whether it gets done before the contracts are signed.</p>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="ai-engineering" term="ai-engineering"/>
        <category label="latency" term="latency"/>
        <category label="caching" term="caching"/>
        <category label="slo" term="slo"/>
        <category label="cdn" term="cdn"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Chain-of-Thought You Stripped to Save Tokens That Hid an Evidence Requirement]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-chain-of-thought-you-stripped-that-hid-an-evidence-requirement</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-chain-of-thought-you-stripped-that-hid-an-evidence-requirement"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Stripping reasoning tokens to cut inference cost looks like a clean optimization until an auditor asks for a rationale you no longer produce. Reasoning traces are dual-use artifacts — engineering cost lines and regulated evidence — and the team that owns the prompt rarely owns the audit.]]></summary>
        <content type="html"><![CDATA[<p>A platform team shipped a prompt refactor that cut average response cost by thirty-two percent. The change was simple: strip the "explain your reasoning" preamble, ask the model to return only the JSON object, and drop the post-processing step that parsed the rationale out of the model's prose. The dashboard turned green. The unit economics page in the quarterly review went from yellow to gold. Nobody on the platform team thought to consult the risk team, because no part of the change touched the answer the customer received.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Chain-of-Thought%20You%20Stripped%20to%20Save%20Tokens%20That%20Hid%20an%20Evidence%20Requirement" alt="" class="img_ev3q"></p>
<p>Two quarters later, a regulated customer's auditor requested the decision rationale for a denied-loan letter from a date six months prior. The team pulled the trace. The input was there. The output was there. The reasoning was gone — not because anyone deleted it, but because it had stopped being produced the day the refactor shipped. The customer's compliance program had been operating on the assumption that the rationale was somewhere in the trace store; the platform team had been operating on the assumption that the rationale was nobody's problem because the customer-facing answer was unchanged. Both assumptions were correct in isolation. Together they cost the customer a regulatory finding and the platform team a contract renewal.</p>
<p>The lesson sounds like a process failure, and it is. But the deeper lesson is structural: reasoning tokens are a dual-use artifact. To the engineering team optimizing per-request cost, they are a line item denominated in dollars per million. To the risk team defending a credit decision, they are the only place the model's "why" lives. When the same byte stream serves two audiences with non-overlapping retention horizons and quality bars, you cannot optimize one without consulting the other — and most platform teams do not even know there is another.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="reasoning-tokens-live-in-two-cost-models-at-once">Reasoning Tokens Live in Two Cost Models at Once<a href="https://tianpan.co/blog/2026-06-02-the-chain-of-thought-you-stripped-that-hid-an-evidence-requirement#reasoning-tokens-live-in-two-cost-models-at-once" class="hash-link" aria-label="Direct link to Reasoning Tokens Live in Two Cost Models at Once" title="Direct link to Reasoning Tokens Live in Two Cost Models at Once" translate="no">​</a></h2>
<p>A reasoning token costs the same as an output token on the wire — sometimes more, when the provider prices "thinking" tokens at a premium. A model that emits eight hundred reasoning tokens to justify a fifty-token answer has just paid for an eighteen-times multiplier on the visible output. From the unit-economics dashboard, that ratio looks like waste. From the compliance dashboard, that ratio is the entire product.</p>
<p>The two dashboards live on the same byte stream and assign it incompatible values. Engineering wants the ratio at one-to-one. Compliance wants the ratio at whatever-it-takes-to-survive-an-audit. The team that owns the prompt is almost always engineering. The team that depends on the output is almost always not in the room when the prompt changes.</p>
<p>This asymmetry is the source of the bug. If reasoning tokens were billed to a budget line owned by the risk team, no platform engineer would touch them without a conversation. They are not. They are billed to the inference budget, owned by infrastructure, optimized against a target that has nothing to do with auditability. The prompt that asks the model to "answer only with the JSON" is a perfectly rational local optimization that produces a globally indefensible artifact.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-failure-modes-that-look-like-optimization">The Failure Modes That Look Like Optimization<a href="https://tianpan.co/blog/2026-06-02-the-chain-of-thought-you-stripped-that-hid-an-evidence-requirement#the-failure-modes-that-look-like-optimization" class="hash-link" aria-label="Direct link to The Failure Modes That Look Like Optimization" title="Direct link to The Failure Modes That Look Like Optimization" translate="no">​</a></h2>
<p>Three patterns reliably destroy evidence chains without anyone noticing until the auditor arrives.</p>
<p><strong>The clean-output refactor.</strong> The prompt is rewritten to drop the "explain your reasoning, then answer" preamble. The model now emits the JSON directly. Cost drops. Latency drops. The reasoning trace is not deleted — it never existed in the first place. The risk team's evidence pipeline was downstream of a string the model used to produce and now doesn't, and the pipeline silently returns empty rationales for every decision shipped after the change. Nobody notices because the rationale is sampled in audits, not in production traffic.</p>
<p><strong>The trace-store retention mismatch.</strong> The reasoning is kept in the response, but the team that owns "observability" routes everything in the trace store to the same retention class. Operational traces age out at thirty days because that is the cost-efficient default for debugging. Compliance evidence ages out at thirty days because nobody told the observability team it was anything else. The audit window for a fair-lending review is seven years. The first time anyone discovers the mismatch is when a regulator asks for a rationale from month seven and the storage layer returns a four-hundred-and-four.</p>
<p><strong>The syntactically-present rationale.</strong> Someone reads a CFPB bulletin and adds "Briefly state the primary reason for the decision" to the prompt. The model produces a sentence. The sentence is technically a rationale. It says "applicant credit profile" or "insufficient documentation" or some other phrase that satisfies a checkbox and tells a denied applicant nothing actionable. The CFPB has explicitly stated that creditors cannot rely on the sample-form checklist of reasons if those reasons do not specifically and accurately indicate the principal reasons for the adverse action. A one-line "credit profile" rationale is exactly the kind of vague placeholder the bureau is calling out. The model is producing tokens; they are not producing evidence.</p>
<p>In each case, the artifact looks superficially correct on the dimension the engineering team measured. Each fails on the dimension the engineering team did not know was being measured.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-regulator-does-not-care-how-you-priced-it">The Regulator Does Not Care How You Priced It<a href="https://tianpan.co/blog/2026-06-02-the-chain-of-thought-you-stripped-that-hid-an-evidence-requirement#the-regulator-does-not-care-how-you-priced-it" class="hash-link" aria-label="Direct link to The Regulator Does Not Care How You Priced It" title="Direct link to The Regulator Does Not Care How You Priced It" translate="no">​</a></h2>
<p>The CFPB's position on ECOA and Regulation B is uncommonly direct for a regulator: a creditor cannot justify noncompliance by arguing that the technology making the decision is too complex or too opaque to identify specific reasons for adverse action. If the model is too complex to produce defensible rationales, the model cannot be used. There is no exception for "we removed the reasoning to save costs." There is no exception for "the rationale was in the trace logs but they aged out." Explainability is a precondition of deployment, not a feature you can amortize against inference budget.</p>
<p>The European posture is the same in different language. Article 12 of the EU AI Act requires high-risk systems to automatically log events sufficient to ensure traceability throughout the system's lifecycle, with logs retained appropriately and protected against tampering. Article 18 obliges providers to retain those logs against the audit horizon, which for high-risk systems extends years beyond the operational window any engineering team would choose on cost grounds. The August 2026 compliance deadline for core high-risk requirements has already passed for some categories, and the FCA's 2026 examination posture explicitly emphasizes "principles with proof" — the regulator wants to see the trace, not a description of the trace.</p>
<p>What this means in practice: a model decision that affects a consumer, a patient, a borrower, or a tenant must produce an artifact that, years later, lets a regulator reconstruct why. If the artifact is gone because the prompt no longer asks for it, the regulator does not accept "we were optimizing." If the artifact is gone because retention was scoped to operational needs, the regulator does not accept "that's what the observability stack defaults to." The defense the platform team would like to mount — that they were maximizing the efficiency of a system they were authorized to maximize — is exactly the defense the rules were written to refuse.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-per-decision-class-policy-beats-a-per-prompt-reflex">A Per-Decision-Class Policy Beats a Per-Prompt Reflex<a href="https://tianpan.co/blog/2026-06-02-the-chain-of-thought-you-stripped-that-hid-an-evidence-requirement#a-per-decision-class-policy-beats-a-per-prompt-reflex" class="hash-link" aria-label="Direct link to A Per-Decision-Class Policy Beats a Per-Prompt Reflex" title="Direct link to A Per-Decision-Class Policy Beats a Per-Prompt Reflex" translate="no">​</a></h2>
<p>The discipline that closes the gap is not "always include reasoning." It is a per-decision-class policy that names, before any prompt is written, whether the reasoning trace is a product surface, an audit surface, both, or neither. The four answers have four different consequences.</p>
<p>When reasoning is a product surface — a coding assistant explaining its diff, a search engine showing its citations — the trace is part of the UX and retention can follow the product's logs. When reasoning is an audit surface only — a credit-decision rationale, a medical-triage justification — the trace is the only artifact the regulator will accept and retention has to match the audit window, which is years and not days. When reasoning is both, the system needs two paths: a redacted, user-friendly version for the product and a full version for the audit pipeline, with the second never throttled by the cost pressure that periodically reshapes the first. When reasoning is genuinely neither — an internal classification with no consumer impact and no regulated dimension — strip it freely.</p>
<p>The policy belongs in the same document that names the model, the prompt, and the cost target. If it lives anywhere else, the next refactor will route around it.</p>
<p>Three operational pieces make the policy survive in production. First, the reasoning trace pipeline is separate from the operational trace pipeline. They have different retention classes, different access controls, and different ownership. The risk team owns the reasoning pipeline; nobody else can change its retention without a formal review. Second, an evidence-quality eval runs against the rationales the model actually emits, grading them not on answer correctness but on whether a human auditor would accept the rationale as specifically and accurately describing the principal reason. This eval catches the syntactically-present-substantively-useless failure mode that no engineering metric will. Third, the cost model for the inference budget prices reasoning tokens against the audit value they produce, not against the visible answer. If a reasoning trace prevents a five-million-dollar regulatory finding, paying a few cents per request to preserve it is a trivially obvious trade.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architectural-realization">The Architectural Realization<a href="https://tianpan.co/blog/2026-06-02-the-chain-of-thought-you-stripped-that-hid-an-evidence-requirement#the-architectural-realization" class="hash-link" aria-label="Direct link to The Architectural Realization" title="Direct link to The Architectural Realization" translate="no">​</a></h2>
<p>Reasoning tokens are the only place the model's "why" lives. They are also, on most systems, the easiest line item to cut, because they are visible in the cost dashboard and invisible in the user-facing output. The two facts together describe a near-perfect failure mode: a thing whose value is held by a team that does not see the cost, optimized by a team that does not see the value.</p>
<p>A team that deletes reasoning to save thirty percent on inference has also deleted the answer to every question an auditor, a customer, or a postmortem is going to ask. The thirty percent shows up in this quarter's financials. The deletion shows up in the next adverse-action complaint, the next compliance review, the next incident where someone needs to know what the model was thinking and the answer is gone. The financial savings are immediate and the cost is deferred, which is exactly the structure of every decision that looks brilliant in the moment and ruinous in the postmortem.</p>
<p>The right framing is not that reasoning tokens are expensive. It is that reasoning tokens are evidence, evidence is a regulated artifact in any high-stakes domain, and the team optimizing the prompt does not get to unilaterally decide what evidence the company is required to produce. Until that framing makes it into the prompt-review checklist, every cost-savings refactor in a regulated product is a compliance time bomb with a fuse the length of one audit cycle.</p>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="llm-ops" term="llm-ops"/>
        <category label="compliance" term="compliance"/>
        <category label="observability" term="observability"/>
        <category label="ai-governance" term="ai-governance"/>
        <category label="cost-optimization" term="cost-optimization"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Coding Agent CI Bill That Doubled Without a Postmortem]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-coding-agent-ci-bill-that-doubled-without-a-postmortem</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-coding-agent-ci-bill-that-doubled-without-a-postmortem"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A coding agent does not push when work is ready — it pushes to find out if work is ready. CI cost stops scaling with commits and starts scaling with plan steps, and the forecast model finance built last year no longer holds.]]></summary>
        <content type="html"><![CDATA[<p>The line item climbed 130% over six weeks and nobody on the engineering team noticed. PRs were landing faster. Per-PR CI cost on the dashboard looked the same as last quarter. The agent's branches went green on the first try more often than the humans' branches did, which actually pulled the median CI duration <em>down</em>. Finance found it during quarterly review, flagged it as an unexplained variance, and asked engineering for the postmortem. Engineering had nothing to write — no incident, no regression, no failed deploy. Just a budget line that had quietly doubled while every dashboard reported normal.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Coding%20Agent%20CI%20Bill%20That%20Doubled%20Without%20a%20Postmortem" alt="" class="img_ev3q"></p>
<p>That postmortem-shaped hole is the artifact. The cost shifted from a labor-dominant curve to an infrastructure-dominant curve, and the team that owned the labor budget was not the team that owned the infrastructure budget. The agent didn't break anything. It just changed which line on the P&amp;L absorbed the work.</p>
<p>The numbers at the platform level tell the same story at a different scale. GitHub reported PRs opened by AI agents rising from roughly 4 million in September 2025 to more than 17 million in March 2026, and weekly Actions compute minutes climbing from 500 million in 2023 to 2.1 billion in a single week of 2026. The June 1, 2026 shift of Copilot to usage-based billing — and the move to charge Copilot code review against Actions minutes at the same per-minute rate as any other workflow — is the provider repricing the same curve your finance team is staring at. The bill is moving because the workflow moved. The dashboard is the last place it shows up because the dashboard was designed around the workflow that no longer dominates.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="ci-was-an-ocommits-cost-the-agent-made-it-oplan-steps">CI Was an O(commits) Cost; The Agent Made It O(plan steps)<a href="https://tianpan.co/blog/2026-06-02-the-coding-agent-ci-bill-that-doubled-without-a-postmortem#ci-was-an-ocommits-cost-the-agent-made-it-oplan-steps" class="hash-link" aria-label="Direct link to CI Was an O(commits) Cost; The Agent Made It O(plan steps)" title="Direct link to CI Was an O(commits) Cost; The Agent Made It O(plan steps)" translate="no">​</a></h2>
<p>A human engineer pushes a commit when they think the change is ready, or close to ready. Maybe two or three CI runs per PR — one when they first push, one after they address review comments, one after a rebase. The cost of CI scales with the number of commits the human authored, and the number of commits is bounded by how often a human types <code>git push</code>.</p>
<p>A coding agent doesn't push when something is ready. It pushes when it wants to <em>find out</em> if something is ready. Tests are not the gate at the end of the work; tests are the feedback signal <em>during</em> the work. Each iteration of the agent's plan triggers a CI run because that run is the cheapest way for the agent to verify whether its last edit moved toward the goal. Ten plan steps in one PR is not unusual. Twenty is not unusual. Cost per PR didn't change because per-PR was always the wrong denominator. The denominator that moved was <em>runs per outcome</em>, and the agent multiplied it.</p>
<p>This is the part the engineering dashboard hides. Dashboards almost always normalize CI cost by "per PR" or "per commit" or "per merged change." Each of those denominators implicitly assumes that the unit of authorship is human-paced. When the author is an agent, every one of those denominators inflates in lockstep with the numerator, and the ratio stays flat. The bill goes up. The ratio is unchanged. The dashboard is technically correct and operationally useless.</p>
<p>A useful denominator measures the cost of CI against something the agent can't inflate: shipped features, customer-visible fixes, or external requirements like compliance attestations. Once you switch to that denominator, the curve becomes visible immediately, and the conversation with finance shifts from "why did infra spend grow" to "what was the per-feature cost of agent-authored work, and is that the rate we want to pay."</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-attribution-gap-is-the-failure-mode">The Attribution Gap Is the Failure Mode<a href="https://tianpan.co/blog/2026-06-02-the-coding-agent-ci-bill-that-doubled-without-a-postmortem#the-attribution-gap-is-the-failure-mode" class="hash-link" aria-label="Direct link to The Attribution Gap Is the Failure Mode" title="Direct link to The Attribution Gap Is the Failure Mode" translate="no">​</a></h2>
<p>If your CI logs do not distinguish agent-authored commits from human-authored commits, you cannot do the analysis your finance team is asking for. You cannot answer "is the growth coming from the agent" because the data does not contain the column.</p>
<p>The fix is straightforward at the metadata level and tedious at the implementation level. Every job that runs in CI should be tagged with the authorship class of the commit that triggered it. That tag has to be applied at the moment the job is enqueued — retroactive attribution from <code>git log</code> will always under-count because some commits get squashed, some agents commit-as-the-human, and some plan steps run on ephemeral branches that never land. Capture it at trigger time, store it on the job, and ship it to the same dashboard that tracks runner minutes.</p>
<p>The same attribution problem shows up at the LLM-cost layer. Practitioners who have wired up production agent cost monitoring tend to converge on a single rule: tags get attached at request creation, never reconstructed from logs. Anthropic's usage API now lets you tag each call with project, team, and task identifiers; the equivalent move for CI is to tag each job at enqueue with <code>actor=agent</code> or <code>actor=human</code> and propagate that tag through every downstream metric. Without the tag, you can audit cost. With the tag, you can govern it.</p>
<p>GitHub's June 2026 introduction of cost centers and per-user budgets exists for exactly this reason. The platform is offering you a column. The work is wiring your CI configuration to populate it correctly — and noticing when an agent runs as the human's identity, which silently mis-classifies the row.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="a-per-author-budget-is-not-punishment-it-is-a-signal-channel">A Per-Author Budget Is Not Punishment; It Is a Signal Channel<a href="https://tianpan.co/blog/2026-06-02-the-coding-agent-ci-bill-that-doubled-without-a-postmortem#a-per-author-budget-is-not-punishment-it-is-a-signal-channel" class="hash-link" aria-label="Direct link to A Per-Author Budget Is Not Punishment; It Is a Signal Channel" title="Direct link to A Per-Author Budget Is Not Punishment; It Is a Signal Channel" translate="no">​</a></h2>
<p>The instinct, when finance flags a variance, is to cap something. Cap the agent's runs, cap the per-PR minutes, cap the model the agent is allowed to use. The cap stops the bleeding, but it also stops the work, and it does not tell the team what to change.</p>
<p>A per-author CI budget has a different purpose. It is a signal channel. It tells the agent — or the human supervising the agent — that the inner loop has become expensive, and it does that early enough to change the loop rather than retroactively after a quarter-end review. Three structural patterns produce the signal without breaking the workflow.</p>
<p>The first is a tiered CI configuration where the agent's inner-loop runs use a fast, cheap test subset, and the full suite is reserved for the moment the PR is marked ready for human review. This mirrors the way fast monorepo build systems like Bazel — and dynamic pipelines on Buildkite — let you compute the affected target set from the diff and run only the tests that intersect it. The agent gets fast feedback. The full suite still runs before merge. The cost of "the agent iterates twenty times" goes down by an order of magnitude because nineteen of those iterations don't run the slow integration tier.</p>
<p>The second is a cost signal exposed back to the agent itself. If the agent can read the cost of its last CI run as part of its observation, it can choose cheaper verification strategies on subsequent steps — running a subset of tests, deferring the slow tier, deciding to read source instead of run a probe. Most teams skip this because plumbing cost back to the agent feels like over-engineering. It is the single highest-leverage piece of plumbing once the agent's run rate exceeds the human team's.</p>
<p>The third is a hard cap that fires not at the per-run level but at the per-task level. A budget that says "this PR has used 40 minutes of Actions time across iterations; the next push from this branch requires a human sign-off" gives the human a place to intervene without preemptively forbidding iteration. The cap is not a refusal. It is a checkpoint, and checkpoints are what let you trust an autonomous loop with a real budget.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-forecasting-model-was-the-hidden-assumption">The Forecasting Model Was the Hidden Assumption<a href="https://tianpan.co/blog/2026-06-02-the-coding-agent-ci-bill-that-doubled-without-a-postmortem#the-forecasting-model-was-the-hidden-assumption" class="hash-link" aria-label="Direct link to The Forecasting Model Was the Hidden Assumption" title="Direct link to The Forecasting Model Was the Hidden Assumption" translate="no">​</a></h2>
<p>The thing that broke is not the CI bill. The CI bill is doing exactly what you would predict given the new workload. The thing that broke is the forecasting model the FP&amp;A team is running against. That model was built when CI cost grew linearly with headcount and shipped feature volume, because a human can only push so many commits per day and only opens a PR when the work is roughly done. The constants in that model — minutes per engineer per week, runs per PR, retries per failed deploy — were stable enough that quarterly variance was a noise term.</p>
<p>Once an agent is authoring commits, those constants are no longer stable. Minutes per engineer per week becomes minutes per <em>engineer-supervised agent loop</em> per week, and the multiplier on that depends on how many concurrent agents the engineer can supervise, which is itself a function of how good your review tooling and your agent's planning loop have gotten this quarter. The forecasting model has a new independent variable, and it is one finance was not told about because the rollout looked like a productivity tool, not a cost-curve change.</p>
<p>The conversation engineering should be having with finance is not "we found the variance and capped it." It is "the cost model assumed labor was the binding constraint and that is no longer true; here is the new model, here is the new run-rate assumption, and here is the per-feature cost we are now willing to pay because the throughput moved." Without that conversation, finance is forecasting against a labor-dominant cost model that doesn't exist anymore, and engineering is treating each quarterly variance as a separate surprise. The variance is not a surprise. It is the new normal expressing itself through a model that hasn't been updated.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="treat-the-coding-agent-as-a-cost-curve-shift-not-a-productivity-tool">Treat the Coding Agent as a Cost-Curve Shift, Not a Productivity Tool<a href="https://tianpan.co/blog/2026-06-02-the-coding-agent-ci-bill-that-doubled-without-a-postmortem#treat-the-coding-agent-as-a-cost-curve-shift-not-a-productivity-tool" class="hash-link" aria-label="Direct link to Treat the Coding Agent as a Cost-Curve Shift, Not a Productivity Tool" title="Direct link to Treat the Coding Agent as a Cost-Curve Shift, Not a Productivity Tool" translate="no">​</a></h2>
<p>The discipline the team that owned this incident wishes they had practiced earlier is small and uncomfortable. Before rolling out the coding agent broadly, write down which line items you expect it to move and by how much. CI minutes is the obvious one. LLM token spend is the obvious one. Less obvious: artifact storage if the agent's iterations produce more build artifacts, secret-scanning and dependency-review costs because they run on every push, code-review tool costs that meter by event volume, and observability ingestion costs because the agent's traces are not free.</p>
<p>Then wire the attribution before the rollout. Tag jobs at enqueue. Add cost centers. Stand up the per-author dashboard before the first agent-authored PR lands, not after the second quarterly variance review. Decide on the per-feature denominator that finance and engineering will both agree to track against. Pre-commit, in writing, to the run-rate you are buying — so that when the rate is reached, the conversation is about renegotiating the rate, not explaining why the variance happened.</p>
<p>The coding agent is not a tool the team adopts. It is a workflow that shifts which budget pays for which work, and the org that doesn't notice the shift is going to keep finding the cost in places it forgot to instrument. The postmortem you want to write is the one that happens before the variance — the one that says: the labor curve is flattening, the infrastructure curve is bending up, and the team that pays for each has agreed on what that trade is worth.</p>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="ai-engineering" term="ai-engineering"/>
        <category label="coding-agents" term="coding-agents"/>
        <category label="ci-cost" term="ci-cost"/>
        <category label="finops" term="finops"/>
        <category label="devops" term="devops"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Compliance Audit That Asked Which Model Produced Which Output]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-compliance-audit-that-asked-which-model-produced-which-output</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-compliance-audit-that-asked-which-model-produced-which-output"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[An endpoint alias is not an artifact. When the auditor asks which checkpoint produced a decision, only per-decision checkpoint pinning gives a defensible answer.]]></summary>
        <content type="html"><![CDATA[<p>The auditor's question sounds simple. She has your appeals log open, points at a row from eight months ago, and asks which model decided that case. Your engineer pulls up the schema: there is a <code>model</code> column, and every decision in the audit window says <code>v1</code>. Then someone from the platform team mentions, almost in passing, that the alias behind <code>v1</code> rotated four times during the audit period — a base model upgrade, a fine-tune refresh, a vendor-side capacity move, and one rollback that lasted six hours during an incident. The honest answer is that you cannot say which checkpoint produced that decision. The auditor writes something down. That phrase is not a regulator-acceptable answer, and you have just learned that the system you shipped has been failing an audit requirement it was never designed to meet.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Compliance%20Audit%20That%20Asked%20Which%20Model%20Produced%20Which%20Output" alt="" class="img_ev3q"></p>
<p>The gap here is not a missing log line. The gap is between two different ideas of what "model" means. To the engineers shipping the system, <code>v1</code> is an endpoint — a stable contract callers can point at while the thing behind it gets upgraded for free. To the auditor, "the model that produced this decision" is a specific artifact: a weight checkpoint, a hash, a thing you could in principle re-run on the same input and get a defensibly similar output. Endpoint aliases were invented to hide checkpoint rotation from callers. Audit-grade provenance demands the opposite — that every decision be attributable to exactly the checkpoint that produced it. The two ideas were on a collision course from the start; the audit just happened to be where they met.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-endpoint-is-not-the-model">The Endpoint Is Not the Model<a href="https://tianpan.co/blog/2026-06-02-the-compliance-audit-that-asked-which-model-produced-which-output#the-endpoint-is-not-the-model" class="hash-link" aria-label="Direct link to The Endpoint Is Not the Model" title="Direct link to The Endpoint Is Not the Model" translate="no">​</a></h2>
<p>The convenience of an alias is real. You ship code that calls <code>model: "v1"</code> or <code>model: "claude-sonnet-latest"</code> or your internal <code>risk-scoring-prod</code>, and you do not have to deploy every time the model behind it changes. Provider-side, the same convenience is even more valuable: vendors rotate model versions, retire old snapshots, and redirect capacity without forcing every customer to cut a release. OpenAI's aliased endpoints behave this way, and Anthropic has been asked for similar <code>-latest</code> aliases for the same reason. The pattern is industry-standard; it would be unusual to find a production AI system that does not use it somewhere.</p>
<p>The problem is that "model" is a polysemous word, and the alias quietly chooses the wrong meaning for compliance purposes. When the data team builds a dashboard and stores <code>model = "v1"</code> next to each decision, they have stored the endpoint name, not the artifact. The endpoint name is approximately useless as an audit primitive, because the function the endpoint computed is not constant across the audit window. You did not run "v1" on the customer's case in February and the same "v1" on a near-identical case in May — you ran two different checkpoints reachable through the same string. Storing the endpoint name in the audit log is roughly equivalent to storing "the production database" instead of the specific row.</p>
<p>This is the silent-versioning problem in its most expensive form. The aliased endpoint that was supposed to free engineering from a coordination tax turned out to be a hidden coordination tax of a different shape — a tax the compliance team has to pay, in arrears, with the auditor watching.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-audit-grade-provenance-actually-requires">What Audit-Grade Provenance Actually Requires<a href="https://tianpan.co/blog/2026-06-02-the-compliance-audit-that-asked-which-model-produced-which-output#what-audit-grade-provenance-actually-requires" class="hash-link" aria-label="Direct link to What Audit-Grade Provenance Actually Requires" title="Direct link to What Audit-Grade Provenance Actually Requires" translate="no">​</a></h2>
<p>The regulatory frame is converging fast on a specific shape. The EU AI Act's Article 10 requires version-control records and provenance information that enable traceability between datasets and model versions; Article 12 requires automatic logging of events that allow full traceability of inputs, outputs, and decision points. The Federal Reserve's revised model risk management guidance — what used to live under SR 11-7 and was updated in April 2026 — keeps the same essential demand: a model is an artifact, not an alias, and the institution must be able to point at the specific artifact that produced any given decision. Adverse-action regimes under ECOA and FCRA make this concrete for consumer credit, where the principal-reasons obligation cannot be answered honestly if you do not know which model generated the score.</p>
<p>Translated out of regulatory language, the demand is: for every decision your AI system makes that has a downstream effect on a person — a credit denial, a claims rejection, a content takedown, a benefits determination — you should be able to produce, on demand, the exact checkpoint identifier and a record that lets a third party reason about that checkpoint's behavior. "We use model v1" does not clear that bar. "This decision was produced by checkpoint <code>sha256:7b3f...</code>, which is registered in our model registry with this card and this evaluation profile" does.</p>
<p>A useful test: imagine the auditor asks you to re-run the case through the same model that decided it. Can you? If the honest answer involves "we'd have to ask the vendor whether they still have that checkpoint hosted," your provenance is below audit grade. If the honest answer is "we cannot, because the production endpoint has since rotated," your provenance is below audit grade. The bar is reproducibility-with-respect-to-the-checkpoint, not reproducibility-with-respect-to-the-endpoint-name.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="how-the-drift-happens-without-anyone-noticing">How the Drift Happens Without Anyone Noticing<a href="https://tianpan.co/blog/2026-06-02-the-compliance-audit-that-asked-which-model-produced-which-output#how-the-drift-happens-without-anyone-noticing" class="hash-link" aria-label="Direct link to How the Drift Happens Without Anyone Noticing" title="Direct link to How the Drift Happens Without Anyone Noticing" translate="no">​</a></h2>
<p>The frustrating part is that no single change feels like a compliance violation. A platform engineer upgrades the underlying model behind <code>v1</code> because the provider deprecated the older snapshot — they have to. A vendor rotates capacity across model versions to balance load — they always have. An MLOps team swaps in a freshly fine-tuned variant behind the same internal endpoint because they want to ship without coordinating with every caller — that is what the endpoint abstraction is for. A six-hour rollback during an incident is reverted before most of the company even sees the page.</p>
<p>Each of these is reasonable on its own. The stack of them is what produces the answer "v1 means four different checkpoints in the same audit window." And because each rotation is invisible at the call site — the request still says <code>model: "v1"</code> and gets back a response shaped like the previous responses — there is nothing in the application code, the request log, or the decision log that would flag the drift. The audit log records the endpoint. The endpoint records nothing.</p>
<p>A secondary failure mode is worth naming: even when teams know they should pin, they often pin in the wrong place. Pinning happens in a config file or environment variable that controls the next request, not in the audit log that records the last decision. If the config rotates between request time and audit time, the audit log inherits whatever the config says today, not what it said when the decision was made. Provenance has to be captured at decision time and stored with the decision, not derived later from a config that has moved on.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="pinning-the-checkpoint-at-decision-time">Pinning the Checkpoint at Decision Time<a href="https://tianpan.co/blog/2026-06-02-the-compliance-audit-that-asked-which-model-produced-which-output#pinning-the-checkpoint-at-decision-time" class="hash-link" aria-label="Direct link to Pinning the Checkpoint at Decision Time" title="Direct link to Pinning the Checkpoint at Decision Time" translate="no">​</a></h2>
<p>The architectural fix is unglamorous and worth doing anyway. At the moment a decision is produced, the system captures the checkpoint identifier the provider actually served, and persists that identifier next to the decision, in the same write, in the same transaction. Not the endpoint name. Not the alias. The specific checkpoint.</p>
<p>For self-hosted models this is straightforward: you control the inference server, you know which checkpoint is loaded, and you can attach a hash of the weights to every response. For hosted APIs it requires more discipline. Most providers return some form of version identifier on the response — OpenAI returns a system fingerprint and a specific model name, Anthropic returns the resolved model in the response, and the major inference gateways expose similar fields. The discipline is to read that field, not the field the caller asked for, and to log the one the provider returned.</p>
<p>A few practices follow from this directly:</p>
<ul>
<li class=""><strong>Treat the request <code>model</code> field and the response <code>model</code> field as different columns.</strong> The request field records what you asked for. The response field records what you got. The audit query reads the response field. If your schema has only one column, your audit answers the wrong question.</li>
<li class=""><strong>Hash what you can, refer to what you cannot.</strong> For self-hosted weights, store the weight hash. For hosted APIs, store whatever stable identifier the provider exposes plus a snapshot of the model card or the provider's published behavior notes from that day. The goal is that a future auditor can, in principle, reason about the artifact behind the identifier — not that you have the weights yourself.</li>
<li class=""><strong>Pin aliases at the gateway, not in application code.</strong> If your platform offers <code>v1</code> as an internal alias, the resolution from <code>v1</code> to a concrete checkpoint should happen in a controlled gateway that logs the resolution, not opportunistically in calling services. One resolution point gives you one place to audit.</li>
<li class=""><strong>Make rotation an event with a record.</strong> When the checkpoint behind an alias changes, that change should produce a durable record — who rotated it, when, from what to what, with what evaluation evidence. The audit story for any decision then has two layers: the per-decision checkpoint identifier, and the rotation history that lets you explain why that checkpoint was in use that day.</li>
</ul>
<p>The decision-log write becomes slightly heavier and the schema gains a few columns. That is the entire cost. The cost of not doing it is the conversation the auditor opened with.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-organizational-move-behind-the-architectural-one">The Organizational Move Behind the Architectural One<a href="https://tianpan.co/blog/2026-06-02-the-compliance-audit-that-asked-which-model-produced-which-output#the-organizational-move-behind-the-architectural-one" class="hash-link" aria-label="Direct link to The Organizational Move Behind the Architectural One" title="Direct link to The Organizational Move Behind the Architectural One" translate="no">​</a></h2>
<p>The deeper change is treating the model registry as a system of record for compliance, not a tool for the ML team. A registry that records every checkpoint that has ever been promoted to production — with its hash, its evaluation profile, its training data lineage, its date range of production use, and the rotation events that moved traffic onto and off of it — is the artifact that lets the company answer the auditor's question without flinching. The registry is the index; the per-decision log is the pointer; together they form a queryable history of "which artifact decided what, and on what evidence was that artifact trusted in production at that moment."</p>
<p>This is also where the work belongs organizationally. Decision logging is an application concern, but checkpoint identity is a platform concern. The platform team owns the gateway that resolves aliases. The platform team owns the registry that records checkpoints. The application team owns the discipline of writing the resolved identifier into the decision log. The compliance team owns the requirement that those identifiers be queryable. None of those parties can do the job alone, and the audit is what reveals which seam was left open.</p>
<p>The seam most teams leave open is the simplest one: nobody decided that the model field in the decision log meant the checkpoint rather than the endpoint. Without that decision, the field defaulted to whatever was easiest, which was the endpoint, because the endpoint is what the calling code knew. The audit-grade fix is small. The audit-grade discipline is choosing to capture artifact identity at every layer where "model" is recorded, and refusing to let an alias stand in for an artifact when a regulator is going to ask. The auditor will eventually ask. You want the answer to already be in the schema.</p>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="ai-compliance" term="ai-compliance"/>
        <category label="model-provenance" term="model-provenance"/>
        <category label="mlops" term="mlops"/>
        <category label="audit" term="audit"/>
        <category label="governance" term="governance"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Conversation Memory Pruning Heuristic That Erased the Context the Next Question Needed]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-conversation-memory-pruning-heuristic-that-erased-the-context-the-next-question-needed</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-conversation-memory-pruning-heuristic-that-erased-the-context-the-next-question-needed"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Recency-and-length pruning evicts the constraint a later turn silently depends on, and the user reads a confident wrong answer as a competence regression. Pruning is the dual of retrieval, and the team that tuned it for token count is silently regressing answer quality.]]></summary>
        <content type="html"><![CDATA[<p>A user opens your long-session agent and says, in turn 3, "I'm vegetarian and on a tight budget." The conversation continues. Eleven turns later, the pruner runs. It counts tokens, finds turn 3 old and short, and drops it to keep the window inside budget. Turn 14 asks, "what should I cook tonight?" The model, looking at a window where the constraint no longer exists, recommends a $40 ribeye. The user reads this as the agent getting worse, opens the satisfaction survey, and rates the session a 2.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Conversation%20Memory%20Pruning%20Heuristic%20That%20Erased%20the%20Context%20the%20Next%20Question%20Needed" alt="" class="img_ev3q"></p>
<p>Nothing in your stack will report a memory failure. The token-budget dashboard will show the window staying healthily under the cap. The latency dashboard will be green. The eval suite — which scores single-turn answers against a held-out set — will report no regression. The only signal that the agent's competence dropped is a thumbs-down rating that your product team will attribute to "model variance." It will not be model variance. It will be a pruning heuristic doing exactly what it was tuned to do, on the wrong objective.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="pruning-is-the-dual-of-retrieval-and-most-teams-built-only-one-side">Pruning Is the Dual of Retrieval, and Most Teams Built Only One Side<a href="https://tianpan.co/blog/2026-06-02-the-conversation-memory-pruning-heuristic-that-erased-the-context-the-next-question-needed#pruning-is-the-dual-of-retrieval-and-most-teams-built-only-one-side" class="hash-link" aria-label="Direct link to Pruning Is the Dual of Retrieval, and Most Teams Built Only One Side" title="Direct link to Pruning Is the Dual of Retrieval, and Most Teams Built Only One Side" translate="no">​</a></h2>
<p>Retrieval has a clear vocabulary. You have a query, a corpus, a similarity function, and a top-k. You evaluate with recall@k against a labeled set. You tune the embedding, the chunking, and the reranker. You measure how often the right passage made it into the window.</p>
<p>Pruning has none of that. The vocabulary that most teams use is "summarize the older half" or "drop turns older than N." There is no query, no labeled set, no recall metric. The choice of what to keep is made by a heuristic that does not know what the next question will be. That asymmetry is the bug.</p>
<p>Read the two operations side by side and the symmetry is obvious. Retrieval asks: of all the things outside the window, which ones does the next turn need? Pruning asks: of all the things inside the window, which ones can I drop without breaking the next turn? Those are the same question, phrased from opposite directions. A retrieval system that selected passages by "shortest and most recent" would be laughed out of the design review. A pruner that selects evictions by "shortest and most recent" gets shipped because nobody calls it retrieval.</p>
<p>The implication is concrete. If your pruner does not know what the user might ask next, it is gambling. The gamble pays off most of the time — recency is, in fact, a decent prior for relevance — and that is precisely what makes the failures so hard to debug. The pruner is right ninety-something percent of the time and catastrophically wrong on the cases where the user is testing whether the agent remembers them.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-pruner-optimized-for-token-count-not-answer-correctness">The Pruner Optimized for Token Count, Not Answer Correctness<a href="https://tianpan.co/blog/2026-06-02-the-conversation-memory-pruning-heuristic-that-erased-the-context-the-next-question-needed#the-pruner-optimized-for-token-count-not-answer-correctness" class="hash-link" aria-label="Direct link to The Pruner Optimized for Token Count, Not Answer Correctness" title="Direct link to The Pruner Optimized for Token Count, Not Answer Correctness" translate="no">​</a></h2>
<p>Walk through what a token-count-driven pruner actually sees. It has a budget — say, 8,000 tokens for conversation history. It has a list of turns with timestamps and lengths. It has a rule: keep the most recent N tokens, drop the rest, optionally summarize the dropped span into a paragraph.</p>
<p>What it does not have: a model of which entities, constraints, or commitments the user introduced. It does not know that "vegetarian" is a hard constraint that should outlive its recency window. It does not know that "the project deadline is Friday" is a commitment that the agent will be held to. It does not know that "I already tried that" is a negation that prevents the agent from re-recommending the same solution. From the pruner's perspective, all tokens are equal, and recent tokens are slightly more equal than old ones.</p>
<p>The teams that tune this layer tune it for cost. They run the pruner, they observe the average tokens-per-turn drop from 12,000 to 7,500, they congratulate themselves on a 37% reduction in input cost, and they ship. The cost dashboard turns into a graph going down and to the right. The quality regression — which only shows up in turns that ask questions implicitly anchored to evicted context — never makes it onto a dashboard, because no single-turn eval can catch it.</p>
<p>This is the most insidious property of the failure mode. The cost win is measurable, fast, and visible. The quality loss is silent, slow, and shows up only in a class of multi-turn interactions that the eval suite was not designed to test. A change that looks like a pure cost optimization is a quiet quality regression, and the team is unblinded.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-single-turn-evals-cannot-see-this">Why Single-Turn Evals Cannot See This<a href="https://tianpan.co/blog/2026-06-02-the-conversation-memory-pruning-heuristic-that-erased-the-context-the-next-question-needed#why-single-turn-evals-cannot-see-this" class="hash-link" aria-label="Direct link to Why Single-Turn Evals Cannot See This" title="Direct link to Why Single-Turn Evals Cannot See This" translate="no">​</a></h2>
<p>Your eval suite probably looks like this. You have a dataset of (input, expected output) pairs. You run the model, you score the output against the expected one, you get a number. The dataset is curated to cover the question types the agent should handle. Each row is independent.</p>
<p>That suite cannot catch a pruning regression by construction. The failure only manifests when turn N+k asks a question that depends on context introduced at turn N, and the pruner ran somewhere between them. Single-turn eval rows have no turn N. They have no pruner step. They have no opportunity for the failure to occur.</p>
<p>The fix is a multi-turn eval that explicitly tests against the pruned window. You take a multi-turn conversation from a real session. You let the pruner do its work at each step. Then, at a turn whose answer depends on early context, you replay the question against the pruned window and score the answer. If the answer is wrong because the constraint is gone, that is your regression signal — and it points at the pruner, not the model.</p>
<p>The mechanics matter. Approaches like N+1 evaluation, where you take a conversation up to turn N and evaluate what happens at turn N+1 across many synthetic continuations, give you a population of late-turn questions to score against. User-simulator evals, where another LLM plays the user with a persona and a set of stated constraints, let you generate the test data at the scale the eval needs. Both are now standard in the multi-turn eval literature, but most production teams have not adopted them because the single-turn suite was already passing.</p>
<p>The discipline this asks for is uncomfortable: you have to treat your pruner as a piece of code under test, with its own metric, separate from the model. Most teams treat it as configuration.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-patterns-that-close-the-gap">The Patterns That Close the Gap<a href="https://tianpan.co/blog/2026-06-02-the-conversation-memory-pruning-heuristic-that-erased-the-context-the-next-question-needed#the-patterns-that-close-the-gap" class="hash-link" aria-label="Direct link to The Patterns That Close the Gap" title="Direct link to The Patterns That Close the Gap" translate="no">​</a></h2>
<p>Once you accept that pruning is a quality lever, the design space opens up. Three patterns recur in the production systems that have stopped regressing on this failure mode.</p>
<p><strong>Entity-anchored memory.</strong> When the user states a fact about themselves, a constraint, a preference, or a commitment, that fact is pinned in a separate store keyed by entity and outlives the recency-based pruning of the conversation buffer. "I'm vegetarian" is not a conversational turn; it is a fact about the user, and the system writes it to a user-facts store that the recall step consults on every turn. This is the move that systems like Mem0, Letta, and the temporal-memory layers in the HINDSIGHT architecture all converged on independently. The conversational buffer can prune freely; the entity store is the part the pruner is not allowed to touch.</p>
<p><strong>Per-session memory eval.</strong> For each real session, you snapshot the pre-prune and post-prune windows at every pruning event. You then replay each subsequent question against the post-prune window and score whether the answer would have been the same. The diff between pre-prune and post-prune answer quality is your pruner's regression rate. Run it as a nightly job over the last day's sessions, alert on the rate crossing a threshold, and your pruner is now under observation in production.</p>
<p><strong>Hybrid memory architecture.</strong> The deeper move is to recognize that the conversation buffer mixes two different kinds of state and the pruner is treating them identically. There is short-term conversational state — what we were just talking about, the working set of the current task — and there are long-term commitments — what the user told us about themselves, what we promised them, what they ruled out. These have different lifetimes, different access patterns, and should have different storage. The working memory / long-term memory split in the AgeMem framework and the lifecycle tiers in AMV-L are the same idea: give each kind of state its own store with its own eviction policy, and stop asking one recency heuristic to serve both.</p>
<p>The implementation is less heroic than it sounds. The user-facts store can be a single JSON document per user, updated by a small LLM call after each turn that extracts new facts. The pruner reads from it on every turn and prepends the relevant facts to the prompt. You will spend an afternoon on it. You will see thumbs-up rates climb on long sessions within a week.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-architectural-realization">The Architectural Realization<a href="https://tianpan.co/blog/2026-06-02-the-conversation-memory-pruning-heuristic-that-erased-the-context-the-next-question-needed#the-architectural-realization" class="hash-link" aria-label="Direct link to The Architectural Realization" title="Direct link to The Architectural Realization" translate="no">​</a></h2>
<p>The team that tuned the pruner as a cost lever was answering the wrong question. The right question is not "how few tokens can I fit in the window" but "given a budget, which tokens maximize the probability that the next k turns produce correct answers." That is a retrieval objective, not a compression objective, and it should be optimized with a retrieval system's tools: a labeled dataset of late-turn questions, a metric for late-turn correctness, an embedding or scoring model that picks what to keep, and a regression test that fires when the metric drops.</p>
<p>What the failure mode reveals is the gap between two views of memory. The first view treats memory as a buffer that has to be kept under a size limit; the engineering problem is compression. The second view treats memory as a database that has to serve queries; the engineering problem is retention policy. Production systems that work are the ones that made the shift from the first view to the second.</p>
<p>The dashboard that will tell you which side you are on is not the token-count graph. It is the regression rate of late-turn answers against the pruned window. If you do not have that metric, the pruner is running silently, and the only feedback channel for its mistakes is your customers giving up on the agent partway through a session. By the time that signal reaches you, the regression has been live for weeks. Build the eval first, then tune the pruner against it. The order matters.</p>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="ai-engineering" term="ai-engineering"/>
        <category label="agents" term="agents"/>
        <category label="memory" term="memory"/>
        <category label="context-engineering" term="context-engineering"/>
        <category label="evals" term="evals"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Conversation Summarization That Erased the Consent Flag the User Gave You]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-conversation-summarization-that-erased-the-consent-flag-the-user-gave-you</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-conversation-summarization-that-erased-the-consent-flag-the-user-gave-you"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Compaction preserves what your agent said and forgets what your user chose. Treat conversation memory as two streams — semantic and structured — or ship a privacy violation in the second.]]></summary>
        <content type="html"><![CDATA[<p>At turn 3, your user clicked "do not retain my code." At turn 7, they toggled off "use my conversations to improve the model." At turn 12, they opted out of cross-session memory. At turn 40, your context budget runs out. The compaction pass folds turns 1–30 into a tidy 200-token summary that reads beautifully: it captures what the user asked, what your agent did, and what came of it. At turn 41, your agent — armed with that summary and the most recent ten turns — confidently writes the user's code into a memory store the user opted out of at turn 7.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Conversation%20Summarization%20That%20Erased%20the%20Consent%20Flag%20the%20User%20Gave%20You" alt="" class="img_ev3q"></p>
<p>Your audit log now contains a consent event at t=3, a violating action at t=41, and between them a paragraph of prose that has no field for <em>why</em> the action was permitted. The summarizer was trained to compress conversations, not to forward control state. Nobody told it the consent toggle was load-bearing. Nobody could have, because consent wasn't in the conversation — it was in a structured field next to it, and the structured field didn't survive the trip through summarization.</p>
<p>This isn't a hypothetical. Every team that has shipped a long-running agent with auto-compaction and a privacy surface has this bug latent in their architecture; most haven't tripped it yet because their sessions don't run long enough or their consent toggles haven't been audited against post-compaction actions. The teams who <em>have</em> tripped it usually discover it the way you discover most production privacy bugs: from a regulator's letter or a customer support ticket that starts with "I opted out of this."</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="conversation-memory-is-two-streams-not-one">Conversation Memory Is Two Streams, Not One<a href="https://tianpan.co/blog/2026-06-02-the-conversation-summarization-that-erased-the-consent-flag-the-user-gave-you#conversation-memory-is-two-streams-not-one" class="hash-link" aria-label="Direct link to Conversation Memory Is Two Streams, Not One" title="Direct link to Conversation Memory Is Two Streams, Not One" translate="no">​</a></h2>
<p>A useful mental model: a long-running agent session carries two parallel streams.</p>
<p>The <strong>semantic stream</strong> is the prose — the user's messages, the agent's responses, the tool calls and their results. It's what your summarizer was designed to compress. When you read a post-compaction summary, this is what you see.</p>
<p>The <strong>structured stream</strong> is everything else — consent flags, permission grants, region-of-operation, the user's selected pseudonym, the redaction policy in force, the data-retention class of the session, the regulatory jurisdiction. Some of it the user set explicitly through UI. Some of it came from the auth layer. Some of it was inferred from a tool call ("user invoked the EU-resident-only handler, so this session is GDPR-scoped"). Almost none of it appears in the prose.</p>
<p>A correctly built session keeps both streams synchronized: every action the agent takes is gated by the structured stream and described in the semantic one. A correctly built compaction step preserves both — the prose is summarized and the structured fields are forwarded verbatim.</p>
<p>Most compaction steps preserve only one of the two. The summarizer is an LLM call given a prompt like "summarize this conversation so far, preserving important details." It reads the messages. It does not read the side-band metadata, because the side-band metadata wasn't in its prompt. It produces excellent prose. The structured stream silently disappears, and the agent on the other side of the compaction now has full semantic memory and zero structured state.</p>
<p>This is the failure mode at its plainest: <strong>after compaction, the agent remembers what it was asked and forgets what it was allowed.</strong></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="why-the-test-that-should-catch-this-doesnt">Why the Test That Should Catch This Doesn't<a href="https://tianpan.co/blog/2026-06-02-the-conversation-summarization-that-erased-the-consent-flag-the-user-gave-you#why-the-test-that-should-catch-this-doesnt" class="hash-link" aria-label="Direct link to Why the Test That Should Catch This Doesn't" title="Direct link to Why the Test That Should Catch This Doesn't" translate="no">​</a></h2>
<p>The usual way teams test summarization is to read the summary and judge it: does it capture the conversation? Could a fresh agent pick up where the old one left off? Does the user's intent survive?</p>
<p>These are the right questions for a chatbot. They are the wrong questions for an agent with a privacy surface. The summary can pass all three tests and still be a privacy violation, because the test is evaluating the wrong stream.</p>
<p>A consent flag the user toggled at turn 7 doesn't appear in the prose. It might appear as a system event ("user updated preferences") but the actual state change — <code>retain_code: false</code> — lives in a separate field that the summarizer was never asked to look at. When a reviewer reads the summary and says "yes, this captures the conversation," they are correct. They are also missing the part that matters.</p>
<p>The structural problem is that the metadata you most need to preserve is the metadata that <em>isn't in the conversation</em>. It's adjacent to the conversation. And the people designing the summarizer are usually the AI platform team, who are reasoning about conversation quality. The people who own consent are usually the privacy or legal-engineering team, who are reasoning about audit trails. Neither team is reasoning about the seam between them. The seam is where the bug lives.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="what-a-compaction-step-should-actually-do">What a Compaction Step Should Actually Do<a href="https://tianpan.co/blog/2026-06-02-the-conversation-summarization-that-erased-the-consent-flag-the-user-gave-you#what-a-compaction-step-should-actually-do" class="hash-link" aria-label="Direct link to What a Compaction Step Should Actually Do" title="Direct link to What a Compaction Step Should Actually Do" translate="no">​</a></h2>
<p>A compaction step that respects both streams looks different from a summarization call. It is a multi-stage transformation that treats structured state as a first-class input and output.</p>
<p><strong>Forward every consent flag and policy field verbatim.</strong> The compaction protocol should enumerate the structured fields that exist on the session and copy them across the boundary unchanged. No summarization, no inference, no "let's collapse these three related toggles into one." If the user opted out of code retention, the post-compaction state must contain <code>retain_code: false</code>, byte-for-byte, exactly as it was before compaction. This is the side-band metadata that travels in a channel the summarizer cannot rewrite.</p>
<!-- -->
<div class="loading_VaNF">Loading…</div>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="insider" term="insider"/>
        <category label="ai-engineering" term="ai-engineering"/>
        <category label="agents" term="agents"/>
        <category label="privacy" term="privacy"/>
        <category label="compaction" term="compaction"/>
        <category label="memory" term="memory"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[The Cost Forecast Tied to a Pricing Tier You No Longer Qualify For]]></title>
        <id>https://tianpan.co/blog/2026-06-02-the-cost-forecast-tied-to-a-pricing-tier-you-no-longer-qualify-for</id>
        <link href="https://tianpan.co/blog/2026-06-02-the-cost-forecast-tied-to-a-pricing-tier-you-no-longer-qualify-for"/>
        <updated>2026-06-02T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[A negotiated unit price is not a constant — it is the output of a state machine the vendor runs against your account. When seasonality trips the volume floor, the discount lapses and your forecast quietly goes wrong.]]></summary>
        <content type="html"><![CDATA[<p>The usage curve barely moved. The bill went up 38%.</p>
<p><img decoding="async" loading="lazy" src="https://opengraph-image.blockeden.xyz/api/og-tianpan-co?title=The%20Cost%20Forecast%20Tied%20to%20a%20Pricing%20Tier%20You%20No%20Longer%20Qualify%20For" alt="" class="img_ev3q"></p>
<p>That is the email the finance lead at a mid-sized fintech opened on the first Monday of the quarter. Three months earlier, the engineering org had renegotiated their LLM inference contract and shaved a sizeable percentage off the negotiated unit price by committing to a volume floor. The finance model rolled the new unit price into the FY forecast. Nobody bookmarked the footnote in the pricing schedule that said the discount would lapse if monthly usage fell below the floor for three consecutive months. The seasonal traffic dip in April-May did exactly that. The provider re-tiered the account back to list price. No notification reached engineering, because the notification went to the procurement inbox that nobody had read since the contract was signed.</p>
<p>The forecast was not wrong about how many tokens the product would burn. The forecast was wrong about what those tokens cost, because it assumed a pricing tier that the account no longer qualified for. That distinction — between "we mis-predicted usage" and "we correctly predicted usage at a unit price that no longer applies to us" — is the one most finance models silently get backwards. The unit price is treated as a constant in the spreadsheet, when in fact it is a stateful property of the account that depends on the account's recent behavior.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="vendor-pricing-tiers-are-a-runtime-state-not-a-contract-constant">Vendor pricing tiers are a runtime state, not a contract constant<a href="https://tianpan.co/blog/2026-06-02-the-cost-forecast-tied-to-a-pricing-tier-you-no-longer-qualify-for#vendor-pricing-tiers-are-a-runtime-state-not-a-contract-constant" class="hash-link" aria-label="Direct link to Vendor pricing tiers are a runtime state, not a contract constant" title="Direct link to Vendor pricing tiers are a runtime state, not a contract constant" translate="no">​</a></h2>
<p>When the contract gets handed off from procurement to finance, the negotiated unit price gets transcribed into a cell in a spreadsheet. From that point forward, it behaves like a constant. The forecast multiplies expected tokens by that constant, the budget is sized against the resulting number, and the variance reports compare actual spend against a baseline that assumes the constant still holds.</p>
<p>But the negotiated unit price is not a constant. It is the steady-state value of a function whose inputs include your trailing twelve-month spend, your monthly commitment satisfaction, whether you are within the discounted region of a quantity-break schedule, and in some contracts whether you have hit a specific product mix across embeddings, inference, and batch. The unit price is the output of a small state machine that the provider runs against your account. The state machine has transitions that you can cross without anyone telling you, because the transitions fire on your usage hitting thresholds you set in a document months ago and stopped looking at.</p>
<p>If you wrote production code that read a configuration value from a remote service and never re-read it after startup, you would call that a bug. The cost forecast does exactly this. It reads the unit price once, when the contract is signed, and then treats the value as fixed until the next contract negotiation. The provider's billing system, meanwhile, evaluates the eligibility predicates every billing cycle and emits whatever price comes out. The forecast and reality are guaranteed to diverge the moment the predicates start firing differently — which they will, because the inputs include your own usage and your own usage is not flat.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="the-downgrade-clause-that-lives-in-the-pricing-schedule-footnotes">The downgrade clause that lives in the pricing-schedule footnotes<a href="https://tianpan.co/blog/2026-06-02-the-cost-forecast-tied-to-a-pricing-tier-you-no-longer-qualify-for#the-downgrade-clause-that-lives-in-the-pricing-schedule-footnotes" class="hash-link" aria-label="Direct link to The downgrade clause that lives in the pricing-schedule footnotes" title="Direct link to The downgrade clause that lives in the pricing-schedule footnotes" translate="no">​</a></h2>
<p>Read a typical AI-vendor enterprise pricing schedule and the downgrade clauses are almost never in the main term sheet. They are in the appendix that defines the tier structure, often as a paragraph that begins "Customer's eligibility for the [Tier Name] unit prices shall be maintained for so long as ..." followed by a list of conditions. The conditions are usually written as a floor — a minimum monthly token count, a minimum monthly spend, a minimum number of active workloads. There is then a sub-clause describing what happens when the floor is missed: re-tiering to the next-lowest tier, sometimes with a grace period of one or two months, sometimes without.</p>
<p>The footnote will frequently specify that the re-tiering is automatic, that the provider will provide reasonable notice, and that no refund or true-up is owed for the discounted period preceding the downgrade. That last bit matters. Once you are re-tiered, going back to the discounted rate often requires you to re-qualify on a new trailing window — three or six consecutive months above the floor — not just a single month of recovery. So the bill jump isn't a one-month spike. It's a new baseline that persists until you can prove sustained recovery, and you cannot prove sustained recovery during a quarter you are also trying to plan against the wrong forecast.</p>
<p>The structural problem here is that the people who read this footnote during procurement are not the people who own the forecast. Procurement reads it to red-line risk during negotiation. They successfully removed a few of the worst clauses, accepted the ones that looked manageable, and signed. Finance read the term sheet that procurement summarized for them. Engineering read whatever procurement asked them to validate — usually the rate limit schedule, not the volume-floor schedule. By the time the contract is in force, the only place the downgrade condition is recorded is the PDF in the contract repository, and the only system that consults it on a monthly basis is the vendor's billing engine.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="tier-aware-forecasting-models-the-floor-probability-not-the-floor">Tier-aware forecasting models the floor probability, not the floor<a href="https://tianpan.co/blog/2026-06-02-the-cost-forecast-tied-to-a-pricing-tier-you-no-longer-qualify-for#tier-aware-forecasting-models-the-floor-probability-not-the-floor" class="hash-link" aria-label="Direct link to Tier-aware forecasting models the floor probability, not the floor" title="Direct link to Tier-aware forecasting models the floor probability, not the floor" translate="no">​</a></h2>
<p>The fix at the forecasting layer is not to update the unit price in the spreadsheet whenever it changes. The fix is to stop treating the unit price as a number and start treating it as a function of expected usage relative to the contractual threshold.</p>
<p>The minimum viable version of this is a forecast that produces two numbers per period: the expected token volume, and the probability that the volume falls below the contractual floor. The expected cost is then a piecewise function — discounted unit price multiplied by volume when above the floor, list unit price multiplied by volume when below. The probability-weighted blend of those two regimes is what the forecast should report, not a single point estimate computed from one of them.</p>
<p>For products with strong seasonality, the floor-crossing probability is wildly different across the year. A retail-adjacent workload will sail above the floor in November-December and trip the floor in February-March. A consumer product that runs on a weekly publication cadence will spike on the publication day and droop in the middle of the week. A B2B workload with US-business-hour skew will dip across major holidays and during the slow week between Christmas and New Year. Each of these has a different floor-crossing probability per month, and a forecast that averages them all out produces a confidence interval narrow enough to convince finance that the discounted rate is structural, when in fact it is contingent.</p>
<p>The next layer is to add a "tier transition cost" line to the forecast: how much additional spend gets unlocked the month the downgrade fires. This is the line that makes the issue visible to anyone reading the forecast, because it is a single material number tied to a specific risk. Without that line, the downgrade scenario is buried inside a wider error bar and never gets discussed in the budget review.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="monitor-the-threshold-not-the-absolute-spend">Monitor the threshold, not the absolute spend<a href="https://tianpan.co/blog/2026-06-02-the-cost-forecast-tied-to-a-pricing-tier-you-no-longer-qualify-for#monitor-the-threshold-not-the-absolute-spend" class="hash-link" aria-label="Direct link to Monitor the threshold, not the absolute spend" title="Direct link to Monitor the threshold, not the absolute spend" translate="no">​</a></h2>
<p>The usage-monitoring alert that catches this kind of issue is not the alert your AI infra team probably already has. The team has an alert on "monthly spend exceeds plan by X%." That alert fires after the downgrade has already happened, because the downgrade is the thing that causes the spend to exceed plan. By then it is too late to do anything except eat the higher rate.</p>
<p>The alert that actually helps fires on the leading indicator: monthly token consumption falling below the contractual floor for the first month of the qualifying window. It is a per-tenant or per-product alert, owned by whichever team controls the workload that pushes the account above the threshold. The signal is unambiguous: "you have one month before the discounted rate is at risk and two months before it lapses." The response is to either accelerate eligible workloads onto the contract before the window closes, or to flag the impending re-tier to finance so the forecast can be adjusted before the bill arrives.</p>
<p>The threshold-tied alert has to be encoded somewhere a human-readable. The contract repository is the wrong place, because nobody runs alerts off PDF files. The right place is the same source-of-truth where the rate limits and the SLOs live: a service catalog entry for the vendor account, with the contract's structural parameters — floor, window, transition price — as fields that the monitoring system reads. When the contract renews, the catalog entry changes, and the alerts auto-adjust. When a teammate inspects the entry to debug a cost question, the contract conditions are right there next to the rate limit and the latency target, which is where they belong.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="contract-structure-ratchets-grace-periods-and-quarterly-reviews">Contract structure: ratchets, grace periods, and quarterly reviews<a href="https://tianpan.co/blog/2026-06-02-the-cost-forecast-tied-to-a-pricing-tier-you-no-longer-qualify-for#contract-structure-ratchets-grace-periods-and-quarterly-reviews" class="hash-link" aria-label="Direct link to Contract structure: ratchets, grace periods, and quarterly reviews" title="Direct link to Contract structure: ratchets, grace periods, and quarterly reviews" translate="no">​</a></h2>
<p>The contractual fix is to add a downgrade ratchet to the next renewal. A ratchet is a clause that prevents automatic re-tiering for a defined grace period — often one or two quarters — even if the floor is breached. The provider doesn't love this clause, because it pushes the risk onto them, but it is a standard ask at the enterprise tier and frequently granted in exchange for a marginally higher floor or a longer initial term. The trade is reasonable. You get a buffer against seasonality and incident-driven dips, the provider gets a slightly stronger commitment.</p>
<p>The next ask is a notification clause that names a specific role on your side — not just "Customer" — to receive thirty-day warning of an impending downgrade. The role should be tied to a function, not an individual: "Director of Platform Engineering" rather than "Jane Smith." The vendor's customer-success function can usually deliver this without escalation, because it is in their interest to give you time to course-correct before they lose your business at renewal.</p>
<p>The procurement-side practice that closes the loop is a quarterly contract review co-owned by finance and engineering. The agenda is short: read the trailing-twelve-month usage curve, compare it to the contractual floors and ceilings, identify any tier transitions in the upcoming quarter, and adjust the forecast accordingly. The meeting is dull, which is the point. The contracts are stable, the curves are mostly predictable, and a fifteen-minute review every ninety days catches the kind of slow-burn issue that otherwise gets discovered in a CFO email about a 38% variance.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="treat-the-contract-as-production-infrastructure">Treat the contract as production infrastructure<a href="https://tianpan.co/blog/2026-06-02-the-cost-forecast-tied-to-a-pricing-tier-you-no-longer-qualify-for#treat-the-contract-as-production-infrastructure" class="hash-link" aria-label="Direct link to Treat the contract as production infrastructure" title="Direct link to Treat the contract as production infrastructure" translate="no">​</a></h2>
<p>The leadership reframe is that the procurement contract is a piece of production infrastructure. It is not a one-time legal artifact that gets filed when the deal closes. It is a configuration document whose state drives a cost line on the income statement, whose footnotes are SLOs you have to monitor, and whose renewal cycle is a deploy you have to plan against. The people who treat it that way recover from seasonality without surprises. The people who treat it as a static spreadsheet input keep getting blindsided by bills that are technically correct under terms they technically agreed to.</p>
<p>What this looks like in practice is that the contract has an owner on the engineering side — not just on the procurement side — and that owner is the one who signs off on the cost forecast. The forecast itself encodes the contract's state machine rather than its steady-state value. The monitoring stack alerts on the contractual threshold rather than the resulting spend. The renewal calendar drives a quarterly review whose attendees include the people who own the workloads, not just the people who own the line on the budget. None of this is heroic engineering. It is the cost-engineering version of treating a config file with a deploy schedule as code, which most teams already do for every other piece of production state. The contract is the same kind of object. It is overdue to be governed like one.</p>]]></content>
        <author>
            <name>Tian Pan</name>
            <uri>https://tianpan.co</uri>
        </author>
        <category label="finops" term="finops"/>
        <category label="ai-cost" term="ai-cost"/>
        <category label="vendor-contracts" term="vendor-contracts"/>
        <category label="forecasting" term="forecasting"/>
        <category label="procurement" term="procurement"/>
    </entry>
</feed>