Skip to main content

4 posts tagged with "hitl"

View all tags

The Human Review Queue Is Your P0 SLA: When HITL Becomes the Bottleneck

· 11 min read
Tian Pan
Software Engineer

The first incident is rarely an outage. It's a Slack message from someone in customer success: "Hey, are we OK? Five customers in the last hour escalated tickets that have been sitting in 'awaiting review' for over a day." You check the model latency dashboard. Green. You check the agent's success rate. Green. You check the cost-per-call graph. Healthy. Everything you instrumented is fine. The thing that's broken is a queue your monitoring stack doesn't know exists, staffed by people whose calendars your capacity planner doesn't read, governed by an SLA that nobody has ever written down.

That queue is your human-in-the-loop escalation path. You added it three months ago "for safety" — the agent would defer to a human reviewer on the small fraction of cases where its confidence was low or the action was high-stakes. At launch it caught maybe a dozen items a day. The ops team handled them between other tasks. It was a backstop, not a system. Today it's processing thousands of items, the median time-to-resolution has tripled, and the customers waiting in line are quietly churning. The HITL path didn't fail. It just stopped being treated like production.

The Human Attention Budget Is the Constraint Your HITL System Silently Overspends

· 10 min read
Tian Pan
Software Engineer

The 50th decision your reviewer makes this morning is not the same quality as the first. The architecture diagram does not show this. The capacity model does not show this. The dashboard tracking "approvals per hour" actively hides it. And yet the entire premise of your human-in-the-loop system — that a person catches what the model gets wrong — is silently degrading from the moment the queue begins to fill.

Most HITL designs treat reviewer time as an infinite, fungible resource. The team sets a confidence threshold, routes everything below it to a human queue, and declares the system "safe." Six weeks later, the approval rate has crept up to 96%, the queue is twice as deep as the staffing model assumed, and a sample audit shows that reviewers are clicking "approve" on edge cases they would have flagged on day one. The system has not failed. It has rubber-stamped its way into looking like it is working.

Abstain or Escalate: The Two-Threshold Problem in Confidence-Gated AI

· 13 min read
Tian Pan
Software Engineer

Most production AI features ship with a single confidence threshold. Above the line, the model answers. Below it, the user gets a flat "I'm not sure." That single number is doing two completely different jobs at once, and it's why your trust metric has been sliding for two quarters even though your accuracy on answered queries looks fine.

The right design has at least two cutoffs. An abstain threshold sits low: below it, the model declines because no answer is worth more than silence. An escalate threshold sits in the middle: between the two cutoffs, the system hands the case to a human reviewer instead of dropping it on the floor. Collapse them into a single dial and you ship a product that feels equally useless when it's wrong and when it's uncertain — which is the worst possible position to occupy in a market where users have a free alternative one tab away.

This isn't a new idea. The reject-option classifier literature has been arguing for split thresholds since the 1970s, distinguishing ambiguity rejects (the input is between known classes) from distance rejects (the input is far from any training data). Production AI teams keep rediscovering the same lesson the hard way, usually about six months after their first launch, when the support queue is full of people typing "is this thing broken or what."

Human-in-the-Loop Is a Queue, and Queues Have Dynamics

· 11 min read
Tian Pan
Software Engineer

Teams add human approval to an AI workflow the same way they add if (isDangerous) requireHumanApproval() to a codebase: as a binary switch, checked once at design time, then forgotten. The metric on the architecture diagram is a green checkmark next to "human oversight." The metric that actually matters — how long the human took, whether they read anything, whether the item was still relevant by the time they clicked approve — rarely has a dashboard.

Treat the human approver as a binary switch and you have built a queue without knowing it. And queues have dynamics: backlog that grows faster than you staff, staleness that makes yesterday's decision meaningless, fatigue that turns review into rubber-stamping, and priority inversion that parks the one decision that mattered behind three hundred that didn't. None of this is visible in the architecture diagram. All of it shows up in the incident retro.