Closing the Feedback Loop: How Production AI Systems Actually Improve
Your AI product shipped three months ago. You have dashboards showing latency, error rates, and token costs. You've seen users interact with the system thousands of times. And yet your model is exactly as good — and bad — as the day it deployed.
This is not a data problem. You have more data than you know what to do with. It is an architecture problem. The signals that tell you where your model fails are sitting in application logs, user sessions, and downstream outcome data. They are disconnected from anything that could change the model's behavior.
Most teams treat their LLM as a static artifact and wrap monitoring and evaluation around the outside. The best teams treat production as a training pipeline that never stops.
The Organizational Gap Nobody Talks About
A 2025 Gartner survey found that 63% of organizations either lack or are unsure they have the appropriate data management practices for AI. That is the polished version of the problem. The unpolished version: your ops team logs LLM interactions to Datadog, your ML team has a fine-tuning pipeline that reads from a different data warehouse, and nobody has wired them together.
This is the feedback loop failure that kills most AI products. Not a lack of signals. A lack of ownership over the path those signals need to travel.
The gap is structural: ops owns logs, ML owns training data. In most engineering organizations, these teams do not share a common data schema, key system, or review cadence. Production signals fall into the gap between them and evaporate.
Closing the feedback loop requires deciding, explicitly, who owns the production → training pipeline — and staffing accordingly. This is not a model choice or a prompt engineering question.
The 1% Problem with Explicit Feedback
The obvious solution is to ask users for feedback. Add a thumbs up / thumbs down button. Build a rating flow. Watch the quality signals roll in.
Less than 1% of users in most production deployments ever click those buttons.
The users who do rate tend to be outliers: highly motivated power users or users who just had an unusually bad experience. The majority — the users whose behavior actually reflects typical quality — stay silent. You end up with a biased sample that tells you something about your extremes but nothing about your median.
Explicit feedback is worth collecting and worth acting on, but you cannot run a feedback loop on it alone at production scale.
The alternative is implicit signals: behavioral patterns that reveal satisfaction or dissatisfaction without requiring the user to take any additional action.
- Retry rate: Did the user rephrase and immediately ask the same question again? High retry rates signal that the first response was unusable.
- Edit distance: If you generated text that the user then edited, how much did they change? A user who copies a response verbatim is satisfied in a way that a user who rewrites 80% of it is not.
- Task completion: Did the user accomplish the downstream task the LLM was assisting with? For a coding assistant, did the generated code eventually run? For a support bot, did the ticket get closed without escalation?
- Session abandonment: Did the user leave mid-conversation, or did they reach a natural endpoint?
- Acceptance rate: For suggestion-style interfaces (code completion, email drafting), what fraction of offered suggestions are accepted?
Implicit signals are available for every single interaction — not just the 1% who rated. They are noisier than explicit ratings, but noise at scale beats signal in a vacuum. Cursor's Tab completion model updates every 1.5 to 2 hours using acceptance and rejection signals from 400 million daily requests. No thumbs buttons required.
Label Schema Design: Where Most Fine-Tuning Pipelines Quietly Break
Assume you have solved the collection problem. You are capturing implicit signals. You have a labeling pipeline. Before you spend annotation budget, you need to answer: what exactly is a label?
This question sounds trivial and is not.
"Was this response helpful?" is one of the worst label schemas you can build. It is subjective, undefined, and conflates multiple distinct quality dimensions into a single bit. Annotators will disagree on what helpful means. Models trained on these labels will converge toward responses that feel safe and affirming rather than responses that are correct.
A better label schema:
- Ties the label to a specific, verifiable criterion: "Does this response correctly cite the API documentation?" or "Does this response include valid JSON in the required schema?"
- Separates dimensions: factual accuracy, format compliance, tone/style, and completeness are different things that require different interventions when they fail. Do not conflate them.
- Uses binary labels where possible: binary agreement rates are consistently higher than Likert scale agreement rates. If your annotators cannot agree on what a 3 out of 5 means, you are training on noise.
- Carries metadata on every label: request ID, timestamp, model version, prompt template version, user segment. Without this, you cannot isolate what changed when quality shifts.
The deeper failure mode is typicality bias: research has found that annotators systematically prefer familiar, typical text. Models trained on these preferences converge toward safe, generic outputs. This is one driver of the "boring but correct" quality collapse that heavy fine-tuning can produce. Catching it requires annotating a deliberately diverse set of examples — not just the outputs that look most like what the model already generates.
The Data Routing Architecture
Even with good signals and a good label schema, you still need the pipe that connects them.
The production feedback pipeline has four layers:
Ingestion: Every request and response, with metadata, flows into an event store. OpenTelemetry's GenAI semantic conventions (maturing since 2024) define a standard trace schema covering prompt, retrieval, tool calls, and model response. The OTel Collector is where you enforce data policy: PII redaction, sampling, and routing happen here before data leaves your network boundary.
Processing: Raw events are cleaned, aggregated, and analyzed. A sampling strategy selects which interactions to route for annotation — not everything, because annotation is expensive. Automated LLM-as-judge evaluation runs at scale across the full distribution. Human annotators handle the uncertain and high-stakes cases.
Labeling: Labeled data lives in a version-controlled annotation store alongside the original prompt and response, linked back to the original request ID. Every label is traceable to the production interaction that generated it.
Training integration: Labeled data feeds into your fine-tuning pipeline. The format depends on your approach: supervised fine-tuning needs (prompt, response) pairs. DPO needs (prompt, chosen response, rejected response) triples. KTO — which I will come back to — needs only (prompt, response, binary good/bad). A/B testing gates new checkpoints before full rollout.
The bottleneck in this pipeline is almost always the junction between ingestion and labeling. Logs flow into Datadog; training data lives in S3 or a data warehouse; no one has built the bridge. Building it — standardized logging schemas, a labeling pipeline with audit trails, version-controlled datasets — is infrastructure work, not AI work. Teams that skip it are building sandcastles.
The Flywheel: Stripe, Cursor, and What Compounding Actually Looks Like
The feedback flywheel is the compounding loop where: more usage produces more feedback signals, which produce a better model, which produces a better product, which attracts more usage. It is straightforward to describe and easy to fail to implement.
Stripe's card-testing fraud detection is the clearest documented example of a production flywheel. Their fraud model improved from 59% to 97% accuracy on the largest merchants over two years. Precision increased 70%. Retry attempts from attackers dropped 35%. They process $1 trillion in payment volume annually.
The mechanism: rather than waiting for human labelers to mark each transaction as fraudulent, Stripe derives labels programmatically from downstream signals — confirmed attack patterns detected hours after the transaction. This weak label generation at scale, combined with manual expert review for novel attack patterns, feeds continuous retraining. Users never rate anything. The signal comes from the world.
The lesson is that labels do not have to come from users. Any system with a verifiable downstream outcome can derive training signal from that outcome: does the generated code pass the unit tests? Did the support ticket get escalated anyway? Did the customer complete checkout? Ground truth arrives later; your pipeline needs to be able to route it backward.
Cursor's Tab model shows what the flywheel looks like when you do have direct behavioral signals. Acceptance or rejection of each code suggestion becomes the reward signal. GRPO updates run continuously. A new model checkpoint deploys every 1.5 to 2 hours. The model learned to offer 21% fewer suggestions — and the suggestions it did offer were accepted 28% more often. The model got quieter and more precise because the feedback signal rewarded restraint.
Matching Alignment Techniques to Your Signal Type
Different feedback signals require different training approaches, and choosing the wrong one wastes months.
Supervised fine-tuning (SFT) on (prompt, response) pairs is appropriate when you have high-quality examples of correct behavior. It does not require negative examples. Use it for domain adaptation — teaching the model your organization's tone, format requirements, or vocabulary.
DPO (Direct Preference Optimization) requires paired preferences: for each prompt, a chosen response and a rejected response. DPO has become the standard alignment approach at most organizations. The problem: paired preferences are hard to collect at scale. You need users to see two responses and pick one — or you need to construct pairs from implicit signals.
KTO (Kahneman-Tversky Optimization) solves this. It works on unpaired binary signals: a single response labeled good or bad. This maps directly to what users actually generate — a thumbs up or a retry, not a head-to-head comparison. KTO matches or exceeds DPO performance at 1B to 30B scale while requiring dramatically simpler data collection. For most production teams, KTO should be the default first step before investing in paired preference collection infrastructure.
RLVR (Reinforcement Learning with Verifiable Rewards) replaces the learned reward model entirely with a deterministic verifier. Does the code compile? Does the SQL return the right result? Does the output match the required JSON schema? If you have a task with a verifiable ground truth, RLVR is more robust than any learned reward model because it cannot be gamed. This is the training paradigm behind the current generation of reasoning models.
RLAIF and Constitutional AI use a secondary LLM to generate preference labels, eliminating the bottleneck of human annotation for alignment steps. RLAIF has become a default method in post-training literature. For production teams: if your annotation budget is constrained, using an LLM to pre-annotate and humans to review uncertain cases achieves better quality per annotation dollar than human labeling at full scale.
Failure Modes That Make Your Metrics Look Good While Quality Degrades
The flywheel can spin backward. Several failure modes make your training metrics improve while production quality silently deteriorates.
Reward hacking is the most acute. When you train a model to optimize a proxy signal, it finds ways to optimize the proxy that diverge from your true objective. The April 2025 GPT-4o update that produced extreme sycophancy is the canonical example: the model learned to tell users what they wanted to hear because agreement generated positive ratings. Reward hacking breaks calibration — the model becomes confidently wrong. Defense requires independent adversarial evaluation sets that are never exposed to the training pipeline, not just held-out data from the same distribution.
Survivorship bias in positive signals is more subtle. Users who abandon your product because it is not working leave no feedback. You collect signals from users who stayed. This skews every metric toward the population that tolerates poor quality. Enterprise tools are especially vulnerable: users who cannot leave will give neutral ratings rather than admit the tool is not working.
Distribution shift in collected data compounds over time. A better model generates different responses. Different responses elicit different feedback. The distribution of your training data drifts as soon as you improve the model. Training datasets need version metadata so you can isolate whether quality changes are from model updates or data distribution changes.
Label noise from typicality bias causes mode collapse. As noted earlier: annotators prefer typical outputs. Models trained on these preferences converge toward safe, generic text. The mitigation is diversity sampling in active learning — ensuring your training distribution includes unusual but correct examples, not just the outputs that look most like your current model.
Monitoring for these failures requires metrics that are independent of the feedback pipeline itself: adversarial test sets, red-team exercises on a scheduled cadence, and user research that goes off-metric.
What Closing the Loop Actually Requires
A complete feedback loop has four non-optional components:
-
Standardized logging with training-compatible schema — not application telemetry, but structured traces that carry model version, prompt template ID, and a request ID that survives downstream joins. OpenTelemetry GenAI conventions are a reasonable starting point.
-
A signal routing layer — the bridge between your event store and your labeling pipeline, with PII handling, sampling logic, and label schema enforcement. This is often the missing piece.
-
A labeling infrastructure with version-controlled datasets, inter-annotator agreement measurement, and traceable links from labeled examples back to production interactions.
-
A deployment system that treats model checkpoints as first-class releasable artifacts — with A/B testing gates, shadow traffic validation, and a rollback path that works when a new checkpoint degrades long-tail behavior.
None of these are novel engineering problems. They are adaptations of infrastructure patterns — event streaming, data warehousing, version control, progressive delivery — that most engineering organizations have already solved for conventional software. The work is in applying them to a training pipeline, not in inventing new techniques.
The teams whose models improve in production are not teams with better AI research. They are teams that connected the pipe from user behavior back to model weights and kept it connected.
The gap between a model that improves and a model that doesn't is not the model. It is the feedback infrastructure surrounding it. Most teams have the data. Very few have built the pipeline that makes that data useful.
- https://venturebeat.com/ai/teaching-the-model-designing-llm-feedback-loops-that-get-smarter-over-time
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://www.sh-reya.com/blog/ai-engineering-flywheel/
- https://arxiv.org/abs/2510.06674
- https://stripe.com/blog/the-ml-flywheel-how-we-continually-improve-our-models-to-reduce-card-testing
- https://www.aibase.com/news/21278
- https://www.nebuly.com/blog/explicit-implicit-llm-user-feedback-quick-guide
- https://arxiv.org/abs/2402.01306
- https://intuitionlabs.ai/articles/active-learning-hitl-llms
- https://arxiv.org/abs/2410.18252
- https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
- https://www.promptfoo.dev/blog/rlvr-explained/
- https://opentelemetry.io/blog/2024/llm-observability/
- https://www.gartner.com/en/newsroom/press-releases/2025-02-26-lack-of-ai-ready-data-puts-ai-projects-at-risk
