Skip to main content

The Contestability Gap: Engineering AI Decisions Your Users Can Actually Appeal

· 11 min read
Tian Pan
Software Engineer

A user opens a chat, asks for a refund, gets "I'm sorry, this purchase is not eligible for a refund," closes the tab, and never comes back. Internally, the agent emitted a beautiful trace: tool calls, intermediate reasoning, the policy bundle it consulted, the model version it ran on. Every span landed in the observability platform. None of it landed anywhere the user could reach. There is no button labeled "ask a human to look at this again," and even if there were, there is no service behind it. The decision is final by default, not by design.

This is the contestability gap, and it is the next thing regulators, lawyers, and angry users are going to rip open. It is also one of the cleanest examples of a problem that looks like policy from the outside and turns out to be plumbing on the inside.

The technical reason the gap exists is that production AI pipelines were optimized for the forward path. The agent reads a request, fetches some context, picks a tool, generates an output, and returns. The reasoning trace exists, but it was logged for the on-call engineer, not for the user. The input snapshot the model actually saw lives in one store; the policy bundle that gated the decision lives in another; the model version is a tag on the deployment, not a field on the decision. Asking "why was this user denied?" three weeks later usually means joining four logs and hoping the retention windows lined up. Asking "re-evaluate this case under different assumptions" almost always means running the same prompt against the same context and getting, predictably, the same answer.

What "Final" Looks Like When It Wasn't Supposed To Be

Walk the user-visible decision boundary in any AI-mediated product and you will find a long list of outputs that the team would, on reflection, classify as appealable — and a short list of UI affordances that actually let the user appeal. An agent declines a refund. A moderation pipeline removes a post. A content ranker buries a creator's video. An identity service flags an account as suspicious and forces a re-verification loop. A hiring tool quietly down-scores a resume. A recommender stops showing a merchant's products to the buyers who used to buy them.

Every one of those is "the model said so." Every one of them is also a decision the regulator now wants you to be able to explain and, in many cases, allow the user to contest. The EU AI Act's right to explanation under Article 86 entitles people affected by high-risk AI decisions to "clear and meaningful explanations of the role of the AI system in the decision-making procedure," and GDPR Article 22 has long required, for solely automated decisions with legal or similarly significant effects, that data subjects be able to obtain human intervention, express their point of view, and contest the decision. The wording is older than the latest model generation, but the obligation is unchanged: a real path back to a human, with a real chance of a different outcome.

Engineering teams tend to discover this requirement in the wrong order. First, they ship the agent. Then someone asks "what's the appeal path?" Then someone realizes there isn't one. Then someone proposes "we'll route to support." Then support points out that they have no input snapshot, no policy version, no record of what the agent actually saw — only a customer transcript that says "AI said no" and a confused human asked to overrule "the system" without knowing what "the system" decided. That conversation usually ends with the appeal getting upheld for the wrong reason or denied for the wrong reason, neither of which is contestability. It is just a coin flip with a friendlier voice.

The Three Things You Need Before You Need Them

Contestability is not a feature you can bolt onto a launched agent in an afternoon. It is three pieces of infrastructure, and skipping any one of them turns the appeal flow into theater.

The first is a per-decision durable record. Not a span, not a log line — a record. For every decision that crosses the threshold of "could affect a user's interests," you need a row that captures the full input snapshot the model saw (canonicalized so two runs on identical inputs hash identically), the model version and provider, the policy bundle or rule version that gated the output, the tool calls that were attempted and their results, and the final output as it was returned to the user. This record needs its own retention policy, decoupled from your hot observability storage. Twelve months is not enough; three to seven years is closer to where regulators are landing for high-risk decisions, and the AI Act's high-risk system audit trail expectations push that number up further. This record is the thing the auditor asks for when they show up. It is also the thing your second-look pipeline reads from when a user appeals.

The second is a user-facing appeal endpoint with an SLA. Not a contact form that lands in a help-desk queue with no decision identifier; a real endpoint, with a real schema, where a user (or their support agent on their behalf) can submit "I'd like decision <id> reviewed" along with new context the original decision didn't have. The endpoint creates an appeal case, links it to the durable decision record, and starts a clock. The clock matters. An appeal that has no SLA is an appeal that quietly dies in a backlog, and "we'll get back to you when we can" is the same outcome as "no" for any user who needs a fast resolution. Cove's appeal API is one example of this shape in the moderation space; the same idea generalizes to any class of decision your users care about. A POST to your appeal endpoint creates a case; a POST back to your callback (or your support tool) records the resolution; both are linkable to the original decision id forever.

The third is a second-look pipeline that is not just the first pipeline run again. This is where teams almost always cut a corner that ruins the whole structure. If the appeal handler simply re-invokes the original agent with the original context, the model — being deterministic on temperature 0 and approximately deterministic everywhere else — produces the same answer, and the user gets a politely worded "we have reviewed your case and confirmed our decision" that no human ever actually reviewed. Meaningful human review, the term GDPR enforcers and the ICO have been hammering on for years, is human review with the authority and the capability to overturn the decision. To make that possible, the second-look pipeline needs to do at least one of: run a different model with a longer context window and a different system prompt, escalate directly to a queue of human reviewers with the case record attached, or apply a more permissive policy bundle that exists specifically for the appeal path. Often, all three.

The Decision Classification Problem

Not every model output deserves an appeal endpoint. The user who asked the agent to summarize a PDF and got a bad summary does not need a contestability record retained for seven years. The user who asked for a refund and got denied does. Somewhere between those two there is a line, and someone on your team needs to draw it explicitly, or your durable-record store will fill up with chat-summary noise and your appeal queue will fill up with users asking the help desk to fix the bullet points.

A useful starting framework is to classify decisions by the reversibility and the user impact of being wrong. Decisions that are low-impact and trivially reversible by the user themselves (regenerate the summary, ask the agent again with different phrasing) are final by design and need only the standard observability trace. Decisions that are higher-impact but still reversible by the user (the agent picked a wrong document, the agent suggested the wrong product) usually need a transparent path to override, but not a formal appeal pipeline. Decisions that are higher-impact and not user-reversible — refusals, removals, denials, downranks, suspensions — are where contestability infrastructure earns its keep, and where the regulator's interest is sharpest.

Drawing this line is a product decision, not an engineering decision, but the engineering team is the one that has to enforce it. The cleanest enforcement is at the gateway layer: the same place that already does cost attribution, rate limiting, and provider routing should also classify the decision, write the durable record, and emit the decision id back to the application. Apps that handle high-stakes outcomes get a contestability story for free; apps that don't, don't pay the cost. This is one of the better arguments for centralizing the LLM gateway in any organization that runs more than two or three AI features in production.

The Org Failure Mode Behind the Technical One

The reason the contestability gap is hard to close is not that the architecture is novel. It isn't — financial services and healthcare have been writing audit-trailed adjudication systems for decades. The reason is organizational. The team that ships the agent sits inside a product org with goals about deflection, automation rate, and cost-per-resolution. The team that would handle appeals sits inside support, trust and safety, or compliance, with goals about resolution time and accuracy. Nobody owns the seam. The agent team's roadmap doesn't include "build an appeal pipeline for someone else's team to operate," and the support team's roadmap doesn't include "build the data plumbing the agent team should have shipped."

The first regulator subpoena, the first class action, or the first viral Twitter thread from a user who could not find a human, makes it everyone's problem at the same time, on a timeline measured in weeks. The teams that have weathered this without firefighting are the ones that did three things early. They named a single owner for "the appeal path" — usually a small platform team that owns the gateway, the durable record store, and the appeal endpoint — and made that team responsible for the contract with both the agent team and the reviewer team. They wrote the decision classification policy down before the first launch, not after the first incident. And they treated the durable record schema as a versioned, breaking-change-reviewed artifact, the same way they treat their public API, because every downstream consumer (legal, audit, the second-look pipeline, the analyst writing fairness reports) breaks if the schema drifts silently.

What To Do Before The First Subpoena

If you ship AI-mediated decisions in production today, the fastest way to find out where your contestability gap is, is to walk the boundary yourself. Pick a recent denial — a refund, a moderation removal, a verification failure — and try to answer four questions. What input did the model actually see, byte-for-byte? Which model version and policy bundle produced the output? Where, in the user-facing product, can the user request a human review of this specific decision? When that request comes in, what pipeline runs, and is it different enough from the original to plausibly produce a different answer?

If you can answer all four cleanly for any decision, more than 24 hours old, you are ahead of most teams shipping in 2026. If you cannot — and most teams cannot — you have just identified the work. Build the durable record first; without it, every other piece is decorative. Build the appeal endpoint second, even if the only consumer for the first quarter is your support team's internal tool. Build the differentiated second-look pipeline last, after you have enough appeal volume to know whether routing to a different model, a different policy, or a human reviewer is the right escalation for each decision class.

The forward path of an AI system is a small fraction of the work. The reverse path — explainable, contestable, and overturnable by a human with real authority — is where the next two years of platform engineering, and most of the next two years of regulatory pressure, are going to land.

References:Let's stay in touch and Follow me for more thoughts and updates