Skip to main content

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

· 11 min read
Tian Pan
Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.

The useful reframe: HITL is not a safety check, it is a service. It has an arrival rate, a service rate, a latency distribution, and a failure mode when arrivals exceed service capacity. Everything that applies to a payment processor or a background job queue applies here, plus a twist — the workers are humans whose throughput degrades in ways a CPU's does not.

Little's Law Applies to Approvers

Little's Law says the average number of items in a stable queue equals the average arrival rate times the average time each item spends in the queue. Applied to an approval workflow: the pending-approvals backlog equals the rate at which AI agents generate approval-requiring actions, multiplied by the average time between submission and decision.

This is not a metaphor. It is the arithmetic. If an agent system generates 1,200 approval-eligible actions per day and the average time-to-decision is four hours, the steady-state backlog is 200 items in-flight at any given moment. Drive arrival rate up by 3x — which is the natural trajectory of any successful agent deployment — and either the backlog explodes or the decision time collapses. There is no third option.

Teams hit this and assume the answer is "add more reviewers." Sometimes it is. More often, the arrival rate has outgrown what any reasonable approver headcount can absorb, and the fix is on the arrival side: stronger validators upstream, stricter policies on what needs human eyes, a confidence gate that lets the clearly-safe 90% bypass the queue entirely. The natural instinct is to staff up; the cheaper fix is almost always to reduce arrivals.

The instrumentation teams lack isn't "did the human approve" — it's the queue's vital signs. Arrival rate. Service rate. Current backlog depth. Time-in-queue distribution, not just the average. The moment arrival rate exceeds service rate for a sustained window, the queue is no longer stable and every downstream metric is a lagging indicator of a failure already in motion.

The Rubber-Stamp Phase Transition

When a human reviews ten items in an hour, they can actually read them. When they review two hundred, they cannot. The transition between these two regimes is not gradual — it looks more like a phase change, because somewhere along the volume curve the reviewer stops cognitively evaluating each item and starts pattern-matching on superficial features.

The measurable signature is distinctive: average review time per item drops sharply while approval rate stays constant or rises. In a well-calibrated system, approval rate should correlate with queue composition — if the policy pipeline is working, hard cases should produce more rejections and more escalations. When approval rate decouples from queue composition and flattens near 100%, the humans are no longer filtering. They are credentialing.

Automation bias makes this worse. Empirical studies of AI-assisted decision-making find that humans agree with incorrect AI recommendations at a baseline rate of around 7%, and that this baseline is surprisingly robust to time pressure — reviewers under a deadline don't think harder, they just approve. Couple this with the fact that reviewing AI output is cognitively different from generating a decision from scratch: the anchor is already set, the frame is already drawn, and "approve" is the path of least resistance.

Teams usually notice this late, because the bad metric looks good. Throughput is up. SLA compliance is up. The ML team points at the high approval rate as evidence their models are performing well. Nobody notices that the humans stopped being humans six weeks ago and the approval step has become ceremonial.

Staleness Is the Silent Failure Mode

An approval that comes back two hours after the agent needed it is often worse than no approval at all. The world has moved on. The ticket the agent was responding to has been reassigned. The customer who sent the refund request has already opened a chargeback. The data the agent was about to update has been updated by a different process. The human clicks approve on a snapshot of an intent that no longer exists.

This is a failure mode queue theory doesn't directly name but every real system exhibits: the service rate matters, but so does the time-relevance curve of the items being serviced. In many HITL workflows, the value of a decision decays sharply with latency. After some threshold — often ten minutes for interactive agent actions, an hour for batch workflows — the decision's information content is stale enough that approving it blindly is nearly as good as rejecting it blindly. Both are noise.

The signal to watch is the gap between "human agrees" and "human agrees in time to matter." These are different questions, and conflating them turns the approval queue into a liability amplifier. Worse: staleness correlates with queue depth, which correlates with the rubber-stamp phase. The deepest part of the backlog is precisely where reviewers are most fatigued, decisions matter least, and the items most likely to be approved without reading.

A useful architectural move is to set a freshness deadline per item and treat the decision differently when the item ages past it. Some teams auto-reject stale items on the theory that a stale reject is safer than a stale approve. Others pull them off the human queue entirely and route them back to the agent with a policy signal that the approval window was missed, letting the agent decide whether to retry or abandon. Both beat silently clearing a backlog of decisions that have already expired.

Priority Inversion in the Approval Queue

Naive approval queues are FIFO. This is wrong in almost every case, and for a specific reason from distributed systems: FIFO queues produce priority inversion. Your high-stakes item — the payment exceeding the policy threshold, the customer-facing action from a VIP account, the data deletion that cannot be undone — sits behind three hundred trivial approvals because it happened to arrive after them.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates