Skip to main content

The Latency Budget Negotiation: How to Tell Product That 'Real-Time' Costs Capability

· 11 min read
Tian Pan
Software Engineer

A product manager walks into a planning meeting with a one-line requirement: "responses under two seconds, like ChatGPT." The agent under discussion makes six tool calls, hits two retrieval indexes, runs a reasoning model with a thinking budget, and validates its output with a second-pass critic. End-to-end p50 is currently nine seconds. The engineering team has three options: say yes and quietly degrade the agent into something worse, say no and watch the PM go shopping for a vendor whose demo video promises the moon, or do the thing nobody teaches in onboarding — open a structured negotiation where every second of latency is convertible to a capability the agent gives up.

Most teams pick option one. The agent ships at two seconds, accuracy drops twelve points, the launch is called a success because the headline latency number was met, and three months later the team is fighting a quality regression that nobody can attribute to a single change because the regression was the launch itself. The latency target was never priced. It was inherited from a product spec that treated speed as free.

This post is about how to price it. The conversation that has to happen is not "we can't do that." It is a structured budget negotiation where latency, accuracy, and capability sit on a table together and product picks two. The team that doesn't run that negotiation ships someone else's tradeoff.

Why "Just Make It Faster" Is Architecturally Meaningless

The reason "make it faster" lands as a non-conversation is that latency in agent systems is not a single dial. It is the sum of architectural decisions whose individual costs are knowable but whose total cost only shows up at the end. A single LLM call might be 800ms; an orchestrator-worker flow with a reflection loop is 10–30 seconds. When you tell the engineering team to halve the latency, you are not asking them to optimize — you are asking them to remove components, and the components have names.

Each component buys something specific:

  • A tool call buys grounding in a system the model doesn't know about.
  • A retrieval pass buys access to facts that didn't fit in the context window.
  • A reasoning model buys a different accuracy curve on multi-step problems.
  • A verifier pass buys a guard against a specific failure class — hallucinated citations, wrong units, fabricated APIs.
  • A second model evaluating the first buys robustness on inputs that look adversarial.

When product asks for the agent to be three times faster without renaming the capability set, they are implicitly asking the engineering team to choose which of these to delete and live with the consequences in silence. Engineering is not in a position to make that choice. Product owns the capability surface; engineering owns the architecture that delivers it. Conflating the two is how you get an agent that's fast and wrong.

Build the Conversion Table Before the Conversation

The piece of paper that makes this negotiation tractable is a latency-to-capability conversion table built from your actual stack. Not a generic vendor benchmark. Your traces, your tools, your model, your network path.

It looks something like this:

  • One tool call: 600–900ms (depends on the tool — your CRM lookup is slower than your in-memory cache)
  • One retrieval pass: 200–400ms with a warm index, 1.5s on a cold one
  • One reasoning pass at low effort: 1.2s and adds roughly 300 hidden tokens
  • One reasoning pass at high effort: 4–8s and can consume 80% of available output tokens before producing the answer
  • One verifier model invocation: 600–1200ms
  • One model swap from a reasoning model to a fast model on the same query: -3s of latency, -8 to -15 points of accuracy on your eval set
  • One context compaction step: 400ms, saves an average of 1100 tokens downstream

Generate these from production traces, not from a vendor's marketing page. Keep the table in a doc product can read. Update it quarterly because every conversion in it shifts as models, infrastructure, and your tool latencies change. The point of the table is not to be precise — it is to make tradeoffs negotiable in units both sides can agree on.

The first time you put this table in front of a PM, the conversation changes shape. "Make it faster" becomes "okay, which of these am I willing to give up." That is the conversation you wanted to have.

The Pick-Two Frame Forces an Honest Choice

The reframe that closes the negotiation is the pick-two frame: "you can have any two of sub-2s response time, 95% accuracy, and the full tool catalog — pick two and I'll architect to it." This is not a rhetorical move. It is a structural truth about agent systems at the current frontier of cost and capability, and stating it this way moves the conversation off "engineering is being negative" and onto "which axis is the product actually optimizing."

Each pair has a defensible architecture:

  • Sub-2s + full tools, lower accuracy. Use a fast model, parallelize tool calls aggressively, drop the verifier, accept that the bottom decile of inputs will produce wrong answers. Good for use cases where the user can spot a wrong answer and retry — autocomplete, search assist, in-line suggestions.
  • Sub-2s + 95% accuracy, smaller tool catalog. Aggressively prune the toolset to what the agent actually needs for the workflow you're scoping. Cache plans for repeat queries. Use a memory layer that can short-circuit common patterns from 30s of planning to 300ms of retrieval. Accept that the agent will look stupid on edge cases outside its scoped surface.
  • 95% accuracy + full tools, higher latency. Use the reasoning model, run the verifier, keep the rich tool catalog, and let the latency stretch to 8–15 seconds. Stream visible progress so the wait feels like work rather than dead air. Good for workflows where the alternative is a human doing the task in 20 minutes.

The PM who says "I want all three" is not asking for a product — they are asking for a different industry. The pick-two frame puts that on the table without making it personal. It also forces them to articulate which axis is actually load-bearing for the user. Most of the time, when forced to pick, they pick the axis that wasn't in their original spec.

TTFT Is the Latency Number Product Actually Cares About

A side benefit of running the negotiation explicitly is that it surfaces a measurement error that almost every product spec has buried in it: the latency number that matters to users is rarely end-to-end completion time. It is time-to-first-token.

The two-second target in the original spec was almost certainly TTFT in the PM's head. They had a specific user experience in mind: the user submits a query, something starts happening on screen within a moment, the response feels alive. Whether the full response takes two seconds or twenty matters far less than whether the first character appears within a few hundred milliseconds. Below 100ms, a system feels instantaneous. Below 500ms, it feels responsive. Beyond a couple of seconds with no visible progress, it feels broken — even if the eventual answer is excellent.

This matters for the negotiation because TTFT and end-to-end latency are independently controllable. You can have a nine-second agent that streams its thinking, calls tools visibly, and shows progress at 200ms TTFT — and users will rate that as faster than a four-second agent that returns a finished answer in one block. Streaming is a perceptual technique that buys you most of the user-experience win without paying the architectural cost of removing components.

Before you accept any latency target, ask product what number they actually mean. Half the time the spec collapses into "TTFT under 500ms with visible progress to completion," which is a much cheaper engineering problem than what the spec literally said.

Prove the Quality Cliff With Multi-Tier Evals

The other artifact that makes the negotiation real is a quality-vs-latency curve, graded against a fixed eval set. Not a single score at the current architecture. Multiple scores, one at each architecture you would consider building if the latency target were tightened.

The eval matrix has rows for architectures (full reasoning + verifier, fast model + verifier, fast model + parallel tools, fast model alone) and a column for accuracy on each task type the agent supports. When product sees that going from a four-second to a two-second target costs them eighteen points on the workflow they care about most, the negotiation is over before it starts. They didn't realize the curve was that steep. They imagined a smooth tradeoff.

This works because product instincts about latency-quality curves are almost always wrong in the same direction: people imagine a gentle slope and the reality is a cliff. The cliff sits at whatever architectural component you have to remove to hit the target. Showing the cliff with numbers is more persuasive than any amount of arguing about it. You are not asking them to take your word — you are showing them the data they would have collected themselves if they had the time to run the eval.

This also creates a useful escape valve for the conversation. When product sees the cliff, the natural follow-up question is "is there a way to push the cliff?" That is a good engineering conversation to have. Memory layers, semantic caching, model distillation, smarter retrieval — these are real levers. They are also long-running engineering investments, and pricing them in the same conversation as a latency negotiation surfaces them as the multi-quarter projects they are, rather than letting them get rolled into a launch sprint as if they were quick wins.

The Failure Mode Is Accepting the Spec

Everything above describes a negotiation. The failure mode worth naming is what happens when no negotiation occurs.

Engineering accepts the latency target as a constraint. They go away and architect to it. The fast model they swap in produces subtly worse outputs on a class of inputs the eval doesn't cover well. The tool the agent stops calling was the one that prevented a category of hallucination nobody had named yet. The verifier they removed was catching a unit-conversion bug once a week. None of this shows up in the launch metrics because the launch metrics are latency and CSAT — and CSAT is high because the agent is fast and confident, and users won't know the answer is wrong until later.

Six months in, the team is fighting a quality regression they cannot localize. Every commit since the launch is a candidate. The actual culprit is the launch itself: an unpriced latency target that traded capability for speed without anyone owning the trade. The PM doesn't remember asking for the model swap. Engineering doesn't remember being asked. The decision happened in the gap between the spec and the architecture, and the gap was a blind spot.

The discipline that closes this gap is small and procedural. Latency targets in product specs require an engineering counterproposal before they become commitments. The counterproposal names the capability tradeoff in the units the conversion table uses. If product accepts the tradeoff, it goes in the spec alongside the latency number — "two seconds, no verifier pass, expected accuracy 84% on the standard eval." If product doesn't accept it, the latency number changes. Either way, both numbers move together, and both are owned.

Latency Is a Cross-Functional Constraint, Not a Performance Metric

The architectural realization underneath all of this is that latency in agent systems is not a performance metric to be optimized in isolation. It is a constraint that prices every architectural choice downstream — tool calls, model selection, verification depth, retrieval breadth, planning horizon. The team that treats it as a performance metric optimizes a single number and ships an agent whose other numbers got worse without anyone noticing. The team that treats it as a constraint runs the negotiation, builds the conversion table, shows the cliff, and ships an agent whose tradeoffs are explicit.

The conversation is uncomfortable the first few times you hold it. PMs are used to latency being a thing engineering owns. Engineering is used to product specs being requirements rather than starting points. The conversion table feels like overkill until the first time a PM looks at it and says "oh — I didn't realize the verifier was buying us that much." After that conversation, every subsequent negotiation is shorter, because both sides have a shared vocabulary for what a second of latency actually buys.

The agents that ship well in 2026 are not the ones with the lowest latency or the highest accuracy. They are the ones whose tradeoffs were chosen, not inherited. The team that runs the negotiation owns its product. The team that doesn't is delivering somebody else's spec, written with no knowledge of the cliff.

References:Let's stay in touch and Follow me for more thoughts and updates