Your Annotation Pipeline Is the Real Bottleneck in Your AI Product
Every team working on an AI product eventually ships a feedback widget. Thumbs up. Thumbs down. Maybe a star rating or a correction field. The widget launches. The data flows. And then nothing changes about the model — for weeks, then months — while the team remains genuinely convinced they have a working feedback loop.
The widget was the easy part. The annotation pipeline behind it is where AI products actually stall.
This is not a tooling problem or a resourcing problem in the usual sense. It is an infrastructure sequencing problem. Teams build the data capture layer first because it is visible and demonstrable. The downstream machinery — schema versioning, inter-annotator agreement tracking, routing, queue management, retraining triggers — gets deprioritized sprint after sprint because it is invisible until it breaks. By the time a team realizes feedback is expiring unused in a database, weeks or months of production signal have become irrelevant.
The Gap Between Capture and Improvement
Consider a typical trajectory. A team ships an LLM feature. They add a thumbs-up/thumbs-down widget and instrument it carefully. Within a week they have thousands of negative signals. Six weeks later, the model is unchanged.
What happened in between? Usually something like: a data scientist exported the feedback to a spreadsheet, noticed the label definitions were ambiguous (does thumbs-down mean "factually wrong" or "not what I wanted"?), opened a discussion in Slack, got pulled into other work, and the thread went cold. The data sat. New feedback accumulated on top of old feedback. Eventually someone proposed a labeling schema, but the first schema was incompatible with how the original feedback was stored, so historical data could not be used. The cycle restarted.
This pattern has a measurable cost. Research consistently shows that 91% of ML models experience performance degradation over time, and that stale annotations — those collected under a different data distribution or against an earlier label schema — become actively harmful to model quality within three to six months. Feedback that is not routed, reviewed, and incorporated within that window is not just wasted; it can degrade the models it was meant to improve.
Why Label Schema Versioning Breaks Everything Downstream
The annotation schema is the contract between human feedback and model training. When it changes — and in fast-moving products, it changes frequently — every downstream system that depended on the old schema must be reconciled.
The problem is that schema changes are almost never tracked with the rigor of code changes. A label goes from binary ("helpful / not helpful") to a four-point scale. A new intent category gets added mid-quarter. An edge case gets reclassified. Each of these changes creates a fork in the annotation history: data collected before the change is not directly comparable to data collected after.
Teams that manage this well treat label schemas exactly like code. They use version identifiers, maintain migration scripts that can reinterpret historical annotations under new definitions, and enforce schema compatibility checks before any new annotation batch is opened. Platforms like Databricks MLflow 3 now provide Git-like version control for labeling schemas precisely because this need is ubiquitous. Snorkel's approach to programmatic labeling — encoding label definitions as executable functions rather than natural language guidelines — makes retroactive relabeling tractable: update the labeling function, rerun it over the historical corpus, and you have a reconciled dataset without human reannotation.
The teams that do not do this find themselves unable to combine feedback collected across different product versions, effectively resetting their training data every time the product evolves.
Inter-Annotator Agreement Is Not a One-Time Check
Most annotation workflows compute inter-annotator agreement (IAA) once, during the initial setup phase, as a sanity check. This is the wrong cadence.
IAA measures the consistency of human labels — how often two independent annotators agree on the same input. Cohen's Kappa corrects for random agreement between two annotators and works well for binary and nominal labels. Krippendorff's Alpha handles multiple annotators simultaneously and works across ordinal and interval data, making it more appropriate for subjective quality ratings. Both metrics are now built into annotation platforms like Prodigy and Datasaur.
The problem is not computing IAA — it is computing it continuously. As an annotation queue grows, new annotators join, guidelines drift through informal interpretation, and the distribution of inputs changes. A team that measured 0.82 Kappa in month one may be operating at 0.58 Kappa by month four without knowing it. The symptoms of this degradation are not annotation errors — they look like model instability, unexplained regressions, and training runs that fail to converge on obvious improvements.
The remedy is weekly IAA checks with a shared calibration set: a fixed sample of inputs that all annotators label independently each week. Disagreements trigger a synchronization session between annotators and the engineering team. This sounds expensive. It is far cheaper than debugging a model that was trained on silently degraded labels.
The Routing Problem Nobody Talks About
Even when feedback is captured and labels are consistent, the data often does not reach the people who can act on it. This is the routing problem, and it compounds the schema and IAA problems.
In a large product, feedback touches multiple model components. A single user interaction might involve a retrieval step, a summarization step, and a formatting step, each owned by a different team. A thumbs-down on the final output could signal failure at any of these layers. Without routing — logic that maps feedback signals to the model components responsible for them — feedback lands in a general queue that nobody feels specifically responsible for.
Effective routing requires two things: a taxonomy of failure modes that maps to ownership, and automated triage that classifies incoming feedback against that taxonomy. This is not a human process at scale. Teams running mature annotation pipelines use classifier models to route feedback: a lightweight model trained on past labeled examples that learns to distinguish "this was a retrieval failure" from "this was a generation failure" from "this was a formatting failure." Each category routes to the appropriate model team's annotation queue.
This sounds like a lot of infrastructure to build before the underlying model is good. That is exactly the point. Teams that defer this infrastructure spend their time on the wrong improvements — making the generator better when the problem is the retriever, or vice versa.
- https://labelyourdata.com/articles/active-learning-machine-learning
- https://mindkosh.com/blog/scaling-annotation-pipelines-without-breaking-label-consistency/
- https://medium.com/glovo-engineering/common-but-overlooked-causes-of-ml-system-failures-part-i-degenerate-feedback-loops-9ee153eae620
- https://snorkel.ai/data-labeling/
- https://huggingface.co/blog/rlhf
- https://encord.com/blog/active-learning-machine-learning-guide/
- https://datasaur.ai/blog-posts/inter-annotator-agreement-krippendorff-cohen
- https://docs.databricks.com/aws/en/mlflow3/genai/human-feedback/concepts/labeling-schemas
- https://labelyourdata.com/articles/machine-learning/data-versioning
- https://huyenchip.com/2023/05/02/rlhf.html
- https://neptune.ai/blog/retraining-model-during-deployment-continuous-training-continuous-testing
- https://humansintheloop.org/what-is-model-collapse-and-why-its-a-2025-concern/
