Skip to main content

Closing the Feedback Loop: How Production AI Systems Actually Improve

· 12 min read
Tian Pan
Software Engineer

Your AI product shipped three months ago. You have dashboards showing latency, error rates, and token costs. You've seen users interact with the system thousands of times. And yet your model is exactly as good — and bad — as the day it deployed.

This is not a data problem. You have more data than you know what to do with. It is an architecture problem. The signals that tell you where your model fails are sitting in application logs, user sessions, and downstream outcome data. They are disconnected from anything that could change the model's behavior.

Most teams treat their LLM as a static artifact and wrap monitoring and evaluation around the outside. The best teams treat production as a training pipeline that never stops.

The Organizational Gap Nobody Talks About

A 2025 Gartner survey found that 63% of organizations either lack or are unsure they have the appropriate data management practices for AI. That is the polished version of the problem. The unpolished version: your ops team logs LLM interactions to Datadog, your ML team has a fine-tuning pipeline that reads from a different data warehouse, and nobody has wired them together.

This is the feedback loop failure that kills most AI products. Not a lack of signals. A lack of ownership over the path those signals need to travel.

The gap is structural: ops owns logs, ML owns training data. In most engineering organizations, these teams do not share a common data schema, key system, or review cadence. Production signals fall into the gap between them and evaporate.

Closing the feedback loop requires deciding, explicitly, who owns the production → training pipeline — and staffing accordingly. This is not a model choice or a prompt engineering question.

The 1% Problem with Explicit Feedback

The obvious solution is to ask users for feedback. Add a thumbs up / thumbs down button. Build a rating flow. Watch the quality signals roll in.

Less than 1% of users in most production deployments ever click those buttons.

The users who do rate tend to be outliers: highly motivated power users or users who just had an unusually bad experience. The majority — the users whose behavior actually reflects typical quality — stay silent. You end up with a biased sample that tells you something about your extremes but nothing about your median.

Explicit feedback is worth collecting and worth acting on, but you cannot run a feedback loop on it alone at production scale.

The alternative is implicit signals: behavioral patterns that reveal satisfaction or dissatisfaction without requiring the user to take any additional action.

  • Retry rate: Did the user rephrase and immediately ask the same question again? High retry rates signal that the first response was unusable.
  • Edit distance: If you generated text that the user then edited, how much did they change? A user who copies a response verbatim is satisfied in a way that a user who rewrites 80% of it is not.
  • Task completion: Did the user accomplish the downstream task the LLM was assisting with? For a coding assistant, did the generated code eventually run? For a support bot, did the ticket get closed without escalation?
  • Session abandonment: Did the user leave mid-conversation, or did they reach a natural endpoint?
  • Acceptance rate: For suggestion-style interfaces (code completion, email drafting), what fraction of offered suggestions are accepted?

Implicit signals are available for every single interaction — not just the 1% who rated. They are noisier than explicit ratings, but noise at scale beats signal in a vacuum. Cursor's Tab completion model updates every 1.5 to 2 hours using acceptance and rejection signals from 400 million daily requests. No thumbs buttons required.

Label Schema Design: Where Most Fine-Tuning Pipelines Quietly Break

Assume you have solved the collection problem. You are capturing implicit signals. You have a labeling pipeline. Before you spend annotation budget, you need to answer: what exactly is a label?

This question sounds trivial and is not.

"Was this response helpful?" is one of the worst label schemas you can build. It is subjective, undefined, and conflates multiple distinct quality dimensions into a single bit. Annotators will disagree on what helpful means. Models trained on these labels will converge toward responses that feel safe and affirming rather than responses that are correct.

A better label schema:

  • Ties the label to a specific, verifiable criterion: "Does this response correctly cite the API documentation?" or "Does this response include valid JSON in the required schema?"
  • Separates dimensions: factual accuracy, format compliance, tone/style, and completeness are different things that require different interventions when they fail. Do not conflate them.
  • Uses binary labels where possible: binary agreement rates are consistently higher than Likert scale agreement rates. If your annotators cannot agree on what a 3 out of 5 means, you are training on noise.
  • Carries metadata on every label: request ID, timestamp, model version, prompt template version, user segment. Without this, you cannot isolate what changed when quality shifts.

The deeper failure mode is typicality bias: research has found that annotators systematically prefer familiar, typical text. Models trained on these preferences converge toward safe, generic outputs. This is one driver of the "boring but correct" quality collapse that heavy fine-tuning can produce. Catching it requires annotating a deliberately diverse set of examples — not just the outputs that look most like what the model already generates.

The Data Routing Architecture

Even with good signals and a good label schema, you still need the pipe that connects them.

The production feedback pipeline has four layers:

Ingestion: Every request and response, with metadata, flows into an event store. OpenTelemetry's GenAI semantic conventions (maturing since 2024) define a standard trace schema covering prompt, retrieval, tool calls, and model response. The OTel Collector is where you enforce data policy: PII redaction, sampling, and routing happen here before data leaves your network boundary.

Processing: Raw events are cleaned, aggregated, and analyzed. A sampling strategy selects which interactions to route for annotation — not everything, because annotation is expensive. Automated LLM-as-judge evaluation runs at scale across the full distribution. Human annotators handle the uncertain and high-stakes cases.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates