Skip to main content

AI User Research: What Users Actually Need Before You Write the First Prompt

· 10 min read
Tian Pan
Software Engineer

Most teams decide they're building an AI feature, then ask users: "Would you want this?" Users say yes. The feature ships. Three months later, weekly active usage is at 12% and plateauing. The postmortem blames implementation or adoption, but the real failure happened before a single line of code was written — in the user research phase that felt thorough but was methodologically broken.

The core problem: users cannot accurately predict their preferences for capabilities they have never experienced. This isn't a minor wrinkle. A study on AI writing assistance found that systems designed from users' stated preferences achieved only 57.7% accuracy — actually underperforming naive baselines that ignored user-stated preferences entirely. You can do a user research sprint that runs for weeks, collect extensive qualitative feedback, and end up with a product nobody uses — not despite the research, but partly because of how it was conducted.

Why the Standard Playbook Breaks Down for AI

Traditional UX research assumes you can show a prototype and gather meaningful feedback. This assumption holds for deterministic software. It doesn't hold for AI features, for two reasons.

First, users have no reference point. If you show a mockup of an AI-generated summary feature, users respond to the mockup's visual design, not the feature's actual value. They can't evaluate what the AI output will actually be like — how consistent it is, when it fails, whether its failures are recoverable, whether it actually saves them time in their real workflow. Their feedback reflects imagination, not experience.

Second, preferences are constructed, not retrieved. Behavioral economists have established that people don't have fixed preferences sitting in storage, waiting to be read out in a survey. Preferences are formed at the moment of decision, shaped by context, framing, and recent experience. When you ask a user whether they'd want AI assistance in their code review workflow, they construct an answer on the spot. That constructed preference has weak predictive power over what they'll actually do when the feature ships.

The consequence: most traditional user research methods produce stated preferences. For AI features, stated preferences are unreliable inputs. Teams that base their product decisions primarily on stated preferences build the wrong things — or build the right things for the wrong use cases.

The Wizard-of-Oz Method: Validate Before You Build

The most powerful technique for early-stage AI research is the Wizard-of-Oz (WOZ) method: users interact with what appears to be an autonomous AI system but is actually controlled by a human operator behind the scenes.

The approach is older than modern AI. It was first documented in the 1970s for testing automated terminal interfaces. The principle: if you need to understand how users interact with an AI system, you need an AI system they can interact with — but that doesn't mean you need to actually build the AI. A human wizard, following a consistent set of rules, can simulate most AI behaviors well enough to generate valid behavioral data.

Here's what a WOZ setup looks like for a code review AI feature:

  • The wizard sits in a separate window or uses a communication channel hidden from the user
  • The user submits code changes "to the AI"
  • The wizard reads the submission and crafts a response following a defined behavior spec
  • The user receives the response and continues the workflow
  • The facilitator observes and notes where the user hesitates, misunderstands, or succeeds

What you learn from this is qualitatively different from what you learn from asking users about the feature. You learn which kinds of suggestions users trust and act on. You learn where the interaction breaks down. You learn whether users actually integrate the workflow into how they work, or whether they test it once and return to their old method. You get behavioral data rather than stated preferences.

The cost of a WOZ session is hours. The cost of building the wrong AI feature is months — plus the organizational weight of a failed initiative. Teams that run WOZ studies before committing to engineering can validate or invalidate core assumptions in a week, then build with real confidence rather than hopeful projections.

Behavioral Interviews: Surfacing Real Workflows Without Asking What Users Want

The second critical method is behavioral interviewing — but not the kind where you ask users what they want. The goal is to surface real, recurring workflow problems that users have already solved imperfectly, because those imperfect solutions reveal genuine unmet needs.

The structure is oriented around recent actual events rather than hypothetical scenarios:

Start with a specific instance. "Walk me through the last time you did a large code review." Not: "Tell me about your code review process in general." General descriptions smooth over friction. Specific instances reveal it.

Follow the friction. When users slow down in their description, hesitate, or mention workarounds, those are signals. "You mentioned you export it to a spreadsheet first — why?" Workarounds that users have built for themselves are among the most reliable indicators of real unmet needs. If someone has built a workaround, the pain was real enough to motivate a fix.

Map the full workflow, not just the target task. AI features rarely exist in isolation. Understanding what happens before and after the task you're optimizing reveals whether your AI is solving the right bottleneck. The bottleneck in a code review workflow might not be the review itself — it might be triaging which PRs need urgent attention, or communicating findings to authors who don't read comments thoroughly.

Ignore the stated preference at the end. When you wrap up a behavioral interview, users will often offer feature requests or praise. Treat these as weak signals. What matters is what you observed in their description of actual behavior. If they described a manual workaround that took 20 minutes, that's a real problem. If they didn't describe any friction but say they'd love an AI assistant, that's a hypothetical preference with limited predictive value.

The diagnostic question after every behavioral interview: "Did I observe any evidence of real, recurring friction that this person has taken action to address?" If the answer is no, the use case may not be real enough.

Reading the Signals: Real Use Case vs. Demo That Dies

Enterprise AI adoption benchmarks show a stark bimodal distribution. Bottom-quartile deployments land at 12–18% daily active users and 25–35% weekly active users 90 days after launch. Top-quartile deployments achieve 82–88% weekly active users at the same mark. The technology stack is rarely the differentiator. The differentiator is how well teams understood the problem before building.

There are behavioral patterns that distinguish real use cases from demos that will die at 12% WAU:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates