The Support Ticket to Eval Case Pipeline Nobody Builds
Every team running an AI feature in production is sitting on the highest-signal eval dataset they will ever have, and they are not using it. The dataset is in Zendesk. Or Intercom. Or Freshdesk, or Help Scout, or whatever queue the support team lives inside. The tickets that get filed there describe the exact failure modes the model produced in front of a paying customer — wrong tone, wrong tool call, wrong policy, hallucinated capability, leaked context. Each one is a labeled negative example, hand-written by the user who experienced the failure, often with reproduction steps and a sentiment annotation attached for free.
The eval suite, meanwhile, lives in Git. It was hand-written by whichever engineer set it up six months ago, and it has accumulated maybe fifty cases since. The intersection between "things the eval suite covers" and "things that actually break in production" is a Venn diagram with a thin sliver of overlap and two large, mutually ignorant lobes.
The thing that fills the gap, when it gets filled at all, is an afternoon — sometimes a full day — once a quarter where someone exports tickets to a CSV, opens them in a spreadsheet, manually rewrites them into eval cases, and pastes the result into a test file. It works. It's also a heroic one-time effort that decays the moment it ships, and the team that did it once almost never does it twice on the same cadence.
Why the gap is structural, not lazy
The reason this pipeline doesn't exist isn't that nobody noticed the signal. Every AI engineer who has read a Hamel Husain post knows that user feedback signals — negative ratings, support tickets, escalations — are the data they should be mining. The reason it doesn't exist is that the two systems are owned by two different teams with two different vocabularies.
Support owns Zendesk. They categorize tickets with their own taxonomy ("billing," "account access," "shipping"), staffed for response time, and measured on CSAT. They have no concept of an "eval case" and no reason to learn one. Engineering owns the eval suite. They categorize cases by failure mode ("missed citation," "wrong tool," "schema violation"), staffed for model migrations, and measured on regression coverage. They have no access to Zendesk's API beyond a read-only Slack notification, and no time to learn a customer-support data model that wasn't designed for them.
The cross-team translation work — reading a ticket, deciding whether it represents a model failure or a product failure or a misuse failure, extracting the input that triggered it, normalizing it into the eval schema — is the work that nobody is staffed to do. So it gets done in afternoons, by whoever has the energy, and it gets done with the cases that happened to be top-of-mind that week. The other 90% of tickets are signal that decays in place.
A Salesforce CRMArena benchmark released in late 2025 found that LLM agents resolved only 35% of multi-turn customer support tasks, even though they handled 58% of single-turn ones. The failures concentrated in the exact patterns support tickets capture best: information mentioned early in a conversation that the agent forgot by the time it had to act on it, secure actions like refunds and escalations where policy constraints mattered, and the long tail of ways customers phrase the same underlying need. These aren't failures the engineering team will ever discover by reading their own dogfooded transcripts. They surface in tickets, and only in tickets.
What a standing pipeline actually requires
A real support-to-eval pipeline isn't a job scheduler that copies tickets into a JSON file. It's a four-stage discipline, and each stage has to be owned by a named person or it doesn't run.
The first stage is ingestion with provenance. Tickets arrive into a staging table — not the eval suite itself — with their full original payload preserved: ticket ID, timestamp, user message, agent response if there was one, the trace ID of the model call if your system emits one, and the support agent's free-text resolution note. Provenance matters because two months later, when someone looks at an eval case that's been failing for a week, the only way to know whether it's still relevant is to follow the chain back to the original ticket and ask whether the underlying product behavior has changed.
The second stage is triage and classification. Not every ticket is an eval case. The ones that are have to be separated from the ones that aren't, and the ones that are have to be tagged with a failure mode. The teams that get this right run a lightweight LLM-based classifier — itself an evaluated piece of infrastructure — that proposes a tag and a confidence score, plus a queue where a human reviews everything below a threshold. Doing it purely manually doesn't scale past about 200 tickets a week. Doing it purely automatically silently rewrites your eval set with the model's own blind spots. The hybrid is the only thing that works.
The third stage is normalization into the eval schema. Support tickets are written in customer prose. They reference internal feature names that have since been renamed. They include screenshots that aren't reproducible. They sometimes describe a failure that happened across multiple sessions. Turning a ticket into an eval case means extracting the minimal input that reproduces the failure, the expected behavior (often by reading the support agent's resolution), and the assertion that will catch it. This is the step nobody wants to do, because it's slow, judgment-heavy, and easy to get wrong. It's also the step that determines whether the resulting eval case is worth running.
The fourth stage is coupling to the regression suite. A new eval case is worth nothing if nobody notices when it breaks. The cases mined from tickets have to flow into the same regression run that gates a prompt change or a model migration. They have to be tagged with the failure mode they guard, so that when an eval fails, the triage path is "look at the failure-mode tag, find the source ticket, decide whether the regression is real." Cases without that backward link become alert fatigue within a month.
The failure modes you only see in tickets
The reason this discipline is worth standing up — rather than continuing to rely on synthetic eval cases written by engineers — is that the failure modes that show up in tickets are the ones engineers cannot imagine. Three patterns recur across teams that have done this work seriously.
The first is policy-shaped failures. A customer asks something the model is allowed to answer in one jurisdiction and not another. Or the model offers a refund that the company's policy says requires human approval. Engineers don't write eval cases for these because they don't know the policies — the policies live in the support team's heads and in the resolution macros they apply. Tickets surface them because the policy violation is what triggered the ticket.
The second is out-of-distribution phrasing. Engineers write eval cases in engineer prose: clear, structured, often a paraphrase of the product docs. Customers write in customer prose: ambiguous, mixed-intent, peppered with typos and product-internal jargon they half-remember. The variance in phrasing is what production exposes and what dogfooded evals cannot. Tickets are a free corpus of in-the-wild phrasing.
The third is multi-turn drift. Single-turn evals dominate hand-written suites because they're easy to write and easy to score. Real failures are usually multi-turn — the customer gave the agent a key constraint in message two, and the agent had forgotten it by message five. Tickets capture the full transcript when the customer pastes it, which they do more often than engineers expect. The teams that mine these get a multi-turn eval set as a side effect.
Where the pipeline tends to die
Three places, predictably. The first is classifier drift. The LLM-based triage classifier was trained against last quarter's failure modes. Six months later, the model has been upgraded, the product has shipped two new features, and the classifier is silently misrouting the new failure modes into the "not relevant" bucket. The fix is to put the classifier itself under eval — sample 50 tickets a month, have a human re-classify them, and watch the agreement rate. If the classifier crosses below a threshold, retrain or rewrite it.
The second is eval-set bloat. Once the pipeline starts running, the eval set grows linearly with ticket volume. Within a year, a team that started with fifty cases is running two thousand, the regression run takes forty minutes, and engineers start skipping it locally. The discipline is a retirement policy: cases that have been green for ninety days and have a duplicate in the suite get archived. Cases tied to features that have since been removed get deleted. The eval set is not a museum.
The third is support-team buy-in. The pipeline depends on support agents writing resolution notes that are actually useful for normalization — naming the failure mode, describing the expected behavior, linking to the relevant macro. If the engineering team builds the pipeline without involving support, the resolution notes will continue to read "Resolved per macro 47" and the normalization stage will stall. The solution is a small change to the support workflow: a structured field on tickets tagged "AI-related" that captures the expected behavior in one sentence. It's a five-minute change for the support manager and a 10x quality lift for the eval pipeline.
The shape of the discipline that actually lands
Teams that get this working in 2026 share a small set of habits. They have one engineer — not a team, one engineer — whose explicit job description includes "owns the support-to-eval pipeline" and who reviews the classifier output weekly. They have a Slack channel where the classifier posts new eval-candidate tickets in real time, so the work is incremental rather than quarterly. They have a retirement cron that prunes the eval set monthly. They have a dashboard that shows, for each failure-mode tag, how many tickets per week and how many regression-suite cases — when those numbers diverge, that's the signal that a category is going underexamined.
The teams that don't have these habits are the same teams that, six months from now, will discover their eval suite still grades the failure modes that mattered when it was written and ignores the ones that have emerged since. They will catch the regression in a ticket. They will not have a process to turn that ticket into a case. They will, once a quarter, schedule an afternoon to do it manually. And the loop will not close.
The support ticket is the most expensive eval case you will ever produce. Somebody got a bad experience, escalated it, and wrote it up for you. Throwing that away is the most expensive thing an AI team does, and almost every team does it. The pipeline is unglamorous, cross-team, and slightly tedious to operate. It is also the single highest-leverage piece of evaluation infrastructure you can stand up this quarter.
- https://hamel.dev/blog/posts/evals-faq/
- https://www.sh-reya.com/blog/ai-engineering-flywheel/
- https://arxiv.org/html/2510.27051
- https://martinfowler.com/articles/reduce-friction-ai/feedback-flywheel.html
- https://www.usefini.com/blog/why-salesforce-s-ai-fails-65-of-cx-tasks-and-why-b2c-cx-leaders-are-re-thinking-ai-support
- https://github.com/microsoft/ai-agent-eval-scenario-library/blob/main/business-problem-scenarios/triage-and-routing.md
- https://blog.bytebytego.com/p/a-guide-to-llm-evals
- https://newsletter.pragmaticengineer.com/p/evals
- https://developers.openai.com/cookbook/examples/evaluation/building_resilient_prompts_using_an_evaluation_flywheel
- https://www.infoq.com/news/2026/03/doordash-llm-chatbot-simulator/
