The Feedback Loop You Never Closed: Turning User Behavior into AI Ground Truth
Most teams building AI products spend weeks designing rating widgets, click-to-rate stars, thumbs-up/thumbs-down buttons. Then they look at the data six months later and find a 2% response rate — biased toward outlier experiences, dominated by people with strong opinions, and almost entirely useless for distinguishing a 7/10 output from a 9/10 one.
Meanwhile, every user session is generating a continuous stream of honest, unambiguous behavioral signals. The user who accepts a code suggestion and moves on is satisfied. The user who presses Ctrl+Z immediately is not. The user who rephrases their question four times in a row is telling you something explicit ratings will never capture: the first three responses failed. These signals exist whether you collect them or not. The question is whether you're closing the loop.
Why Thumbs Ratings Are Structurally Broken
Explicit feedback has a fundamental problem: it captures what users say, not what users do. These two things diverge constantly.
A user might rate an AI writing assistant five stars while editing 80% of every response it generates. The rating captures sentiment — the tool feels useful, the user likes the product. The editing behavior captures quality — the outputs are consistently wrong enough to require manual correction. If you optimize on ratings alone, you'll build a system that feels impressive but produces unreliable work.
The divergence goes the other way too. Users who are quietly satisfied — the ones who accept outputs and keep moving — rarely rate anything. They have no reason to stop and click a star. Your 4.7-star average is built on the voices of people with strong opinions, not the median experience.
Research on multi-turn dialogue systems found that implicit feedback signals appear in over half of later user utterances in a conversation, while explicit ratings are collected for a tiny fraction of interactions. The behavioral data is richer, more frequent, and generated automatically. You're already paying for the infrastructure that produces it.
The Behavioral Signals Worth Collecting
Not all implicit signals are equally informative. Some require context to interpret correctly. Here's a taxonomy of signals ranked roughly by reliability:
High-confidence negative signals:
- Immediate undo (Ctrl+Z within a few seconds of accepting a suggestion)
- Query rephrasing within the same session (the user is retrying because the last response failed)
- Session abandonment without task completion (navigated away mid-workflow)
- Explicit overwrite of the full output rather than targeted editing
High-confidence positive signals:
- Proceeding to the next task without modification or follow-up
- Copying the output to a downstream tool or sharing it externally
- Generating an output and then asking a follow-up that builds on it
Context-dependent signals:
- Response length accepted vs. truncated (if users always scroll and keep reading, the length is appropriate; if they consistently stop reading at line 10, it's not)
- Edit rate on AI-generated content (requires baseline: a 30% edit rate on legal contracts is fine; the same rate on customer emails written by a writing assistant is high)
- Conversation continuation (user responding vs. going silent)
One research finding on dialogue systems: a simple length-based implicit signal — whether users generate responses exceeding a threshold length — produced a 12-point win rate advantage compared to baseline models. That's a meaningfully strong signal from a single cheap-to-collect behavioral metric.
Building the Pipeline From Behavior to Eval Dataset
The goal is to convert behavioral signals into labeled pairs: (input, output, quality_label) that you can use to build eval datasets, fine-tune reward models, or identify regression cases. Here's a practical pipeline.
Step 1: Instrument the interaction layer
Standard observability — latency, error rates, token counts — is not enough. You need interaction-layer observability:
- Log session IDs and turn sequences, not just individual requests
- Record edit events with timestamps relative to output generation
- Capture navigation events (did the user leave the product, go deeper into it, or repeat a step?)
- Record query reformulations (same session, similar semantic intent, new phrasing = retry signal)
You don't need to capture content. You need to capture structure: what happened after the response, how long after, and in what direction.
Step 2: Define behavioral labels per product context
Behavioral signals mean different things in different contexts. An immediate edit of a code suggestion might mean the model was wrong, or it might mean the developer wants to add their own customization. You have to define what each signal means for your specific product.
A useful exercise: for each behavioral signal you instrument, write out the two or three user intent interpretations that could produce it, and design the logging schema to distinguish between them where possible. If you can't distinguish, treat the signal as weak rather than strong.
Step 3: Sample strategically, not randomly
You don't need to collect labels on every session. The high-value cases are at the edges:
- Sessions where the user showed mixed signals (partial edit, then continued)
- High-retry sessions (3+ reformulations on the same underlying task)
- New interaction patterns that don't match your historical baseline
- Sessions from cohorts where your model is performing worst on other metrics
Sampling from the distribution tails gives you the cases most likely to expose model weaknesses. Random sampling gives you a lot of confident successes that don't improve your eval coverage.
Step 4: Convert to eval examples
For each labeled session, extract the canonical (prompt, response, label) triple:
- The prompt is the user's final successful phrasing (after any retries)
- The response is what the model produced at that turn
- The label can be binary (accepted/rejected) or graded (no edit, minor edit, major edit, full replacement)
For retry cases, you now have something more valuable: a natural A/B pair. The model's first response was rejected; the reformulated query produced something the user accepted. That's a preference pair without any annotation — the behavioral sequence is the label.
The Annotation Budget Math
The economic argument for behavioral signals becomes stark when you run the numbers on human annotation.
Human annotation costs roughly $0.08 per label at standard quality. LLM-based annotation costs around $0.002 per label. At equivalent cost, LLM annotation gives you 40× more labeled data — and at scale (roughly 25,000 examples), LLM-annotated datasets match human-annotated datasets in downstream model quality.
Behavioral signals cost almost nothing per label, because the user is doing the labeling by interacting with your product. The cost is infrastructure and pipeline engineering, not per-example labor. At any meaningful user scale — even thousands of monthly active users — you can generate labeled eval datasets faster than any annotation workflow can keep up with.
The catch is noise. Implicit signals are frequent but noisy. A retry might mean the model failed, or it might mean the user changed their mind. A session abandonment might mean dissatisfaction, or it might mean the user got interrupted by a meeting. Building a clean eval dataset from behavioral signals requires filtering, aggregation, and some tolerance for label uncertainty. Research on repeated implicit feedback found that measuring uncertainty across multiple interaction events — rather than relying on any single behavioral event — substantially improves signal reliability.
What Microsoft's Semantic Telemetry Gets Right
Microsoft Research built a framework they call semantic telemetry for understanding how users interact with AI systems at scale. The core insight is using LLMs to classify interaction patterns rather than just counting events.
Their pipeline is three steps:
- Generate a taxonomy: Feed a sample of conversations to an LLM and ask it to produce a set of classification categories — topics, task types, complexity levels.
- Classify at scale: Apply those categories to millions of anonymized interactions.
- Analyze distribution shifts: Watch how the distribution of topics and task complexity changes over time, across cohorts, or across model versions.
What makes this useful for eval is step 3. If users in your power-user cohort start shifting from "lookup and retrieval" tasks toward "synthesis and analysis" tasks, your eval dataset needs to reflect that shift. Behavioral telemetry makes that drift visible. Without it, your eval dataset stays static while the actual usage distribution evolves under it.
The Silent Signals You're Already Ignoring
Some behavioral signals require no new instrumentation because they already exist in your interaction logs — they just haven't been extracted and used.
Ctrl+Z / Undo events: If your product involves accepting AI-generated content into a document or editor, undo events are already logged by the editor. An undo within seconds of accepting a suggestion is the clearest possible negative label you can collect without asking anyone anything.
Share or export events: If a user copies an AI-generated output to clipboard, shares a link, or exports content to another tool, that's a strong acceptance signal. It means they trusted the output enough to move it downstream.
Back-navigation from result pages: If a user clicks into an AI-generated result and immediately navigates back, that's a rejection signal. Same structure as a bounce rate in search — the result didn't match the intent.
Follow-up question structure: A follow-up question that builds on the previous response ("now make it shorter" or "add an exception for edge case X") is a positive engagement signal. A follow-up that re-explains the original task ("no, I meant...") is a correction signal — the model misunderstood.
None of these require a rating prompt. They're already in your logs if you're logging interactions at the session level.
Closing the Loop: From Signals to Continuous Eval
The goal isn't just to collect labeled data — it's to build a feedback loop that continuously updates your understanding of model quality in production.
A practical closed-loop architecture looks like this:
- Instrument interaction events at the session level
- Label behavioral events using the signal taxonomy you've defined for your product
- Sample from the distribution tails to build targeted eval slices
- Evaluate model changes against those slices before deployment
- Monitor distribution shifts in behavioral signals as a leading indicator of quality regressions
The monitoring step is the one most teams skip. Behavioral signal distributions are early-warning systems. If your retry rate spikes after a model update, that's a quality regression signal — often faster than waiting for explicit user complaints or formal eval results to accumulate. Conversely, if session completion rates improve without any prompt change, something in the underlying model behavior shifted in a way your evals didn't predict.
The AI systems that fail most dangerously aren't the ones that crash loudly. They're the ones that keep running, sounding confident, while quietly drifting wrong. Behavioral telemetry is how you see the drift before your users do.
What to Do This Week
You don't need to build the full pipeline immediately. Start with two things:
First, audit what you're already logging. Check whether your interaction logs contain session IDs, turn sequences, and any post-response user actions. If they do, you can start extracting behavioral labels from existing data without writing new instrumentation.
Second, pick one signal that's clearly interpretable in your product context and instrument it deliberately. For most products, query retry rate is the highest-signal, lowest-noise behavioral metric available. A user who rephrases and resubmits is telling you unambiguously that the first response didn't work. Start there.
The feedback loop exists whether you close it or not. Every user session is running a silent evaluation of your model. The only question is whether you're reading the results.
- https://arxiv.org/html/2507.23158v2
- https://arxiv.org/html/2307.14117v2
- https://www.microsoft.com/en-us/research/blog/semantic-telemetry-understanding-how-users-interact-with-ai-systems/
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://dev.to/mosiddi/stop-begging-for-feedback-why-silent-signals-are-the-future-of-ai-learning-40jp
- https://www.montecarlodata.com/blog-ai-telemetry/
- https://www.braintrust.dev/articles/best-ai-evaluation-tools-2026
- https://latitude.so/blog/real-time-feedback-techniques-for-llm-optimization
- https://arxiv.org/html/2506.11019v1
- https://arxiv.org/html/2505.02492
