Skip to main content

AI User Research: What Users Actually Need Before You Write the First Prompt

· 10 min read
Tian Pan
Software Engineer

Most teams decide they're building an AI feature, then ask users: "Would you want this?" Users say yes. The feature ships. Three months later, weekly active usage is at 12% and plateauing. The postmortem blames implementation or adoption, but the real failure happened before a single line of code was written — in the user research phase that felt thorough but was methodologically broken.

The core problem: users cannot accurately predict their preferences for capabilities they have never experienced. This isn't a minor wrinkle. A study on AI writing assistance found that systems designed from users' stated preferences achieved only 57.7% accuracy — actually underperforming naive baselines that ignored user-stated preferences entirely. You can do a user research sprint that runs for weeks, collect extensive qualitative feedback, and end up with a product nobody uses — not despite the research, but partly because of how it was conducted.

Why the Standard Playbook Breaks Down for AI

Traditional UX research assumes you can show a prototype and gather meaningful feedback. This assumption holds for deterministic software. It doesn't hold for AI features, for two reasons.

First, users have no reference point. If you show a mockup of an AI-generated summary feature, users respond to the mockup's visual design, not the feature's actual value. They can't evaluate what the AI output will actually be like — how consistent it is, when it fails, whether its failures are recoverable, whether it actually saves them time in their real workflow. Their feedback reflects imagination, not experience.

Second, preferences are constructed, not retrieved. Behavioral economists have established that people don't have fixed preferences sitting in storage, waiting to be read out in a survey. Preferences are formed at the moment of decision, shaped by context, framing, and recent experience. When you ask a user whether they'd want AI assistance in their code review workflow, they construct an answer on the spot. That constructed preference has weak predictive power over what they'll actually do when the feature ships.

The consequence: most traditional user research methods produce stated preferences. For AI features, stated preferences are unreliable inputs. Teams that base their product decisions primarily on stated preferences build the wrong things — or build the right things for the wrong use cases.

The Wizard-of-Oz Method: Validate Before You Build

The most powerful technique for early-stage AI research is the Wizard-of-Oz (WOZ) method: users interact with what appears to be an autonomous AI system but is actually controlled by a human operator behind the scenes.

The approach is older than modern AI. It was first documented in the 1970s for testing automated terminal interfaces. The principle: if you need to understand how users interact with an AI system, you need an AI system they can interact with — but that doesn't mean you need to actually build the AI. A human wizard, following a consistent set of rules, can simulate most AI behaviors well enough to generate valid behavioral data.

Here's what a WOZ setup looks like for a code review AI feature:

  • The wizard sits in a separate window or uses a communication channel hidden from the user
  • The user submits code changes "to the AI"
  • The wizard reads the submission and crafts a response following a defined behavior spec
  • The user receives the response and continues the workflow
  • The facilitator observes and notes where the user hesitates, misunderstands, or succeeds

What you learn from this is qualitatively different from what you learn from asking users about the feature. You learn which kinds of suggestions users trust and act on. You learn where the interaction breaks down. You learn whether users actually integrate the workflow into how they work, or whether they test it once and return to their old method. You get behavioral data rather than stated preferences.

The cost of a WOZ session is hours. The cost of building the wrong AI feature is months — plus the organizational weight of a failed initiative. Teams that run WOZ studies before committing to engineering can validate or invalidate core assumptions in a week, then build with real confidence rather than hopeful projections.

Behavioral Interviews: Surfacing Real Workflows Without Asking What Users Want

The second critical method is behavioral interviewing — but not the kind where you ask users what they want. The goal is to surface real, recurring workflow problems that users have already solved imperfectly, because those imperfect solutions reveal genuine unmet needs.

The structure is oriented around recent actual events rather than hypothetical scenarios:

Start with a specific instance. "Walk me through the last time you did a large code review." Not: "Tell me about your code review process in general." General descriptions smooth over friction. Specific instances reveal it.

Follow the friction. When users slow down in their description, hesitate, or mention workarounds, those are signals. "You mentioned you export it to a spreadsheet first — why?" Workarounds that users have built for themselves are among the most reliable indicators of real unmet needs. If someone has built a workaround, the pain was real enough to motivate a fix.

Map the full workflow, not just the target task. AI features rarely exist in isolation. Understanding what happens before and after the task you're optimizing reveals whether your AI is solving the right bottleneck. The bottleneck in a code review workflow might not be the review itself — it might be triaging which PRs need urgent attention, or communicating findings to authors who don't read comments thoroughly.

Ignore the stated preference at the end. When you wrap up a behavioral interview, users will often offer feature requests or praise. Treat these as weak signals. What matters is what you observed in their description of actual behavior. If they described a manual workaround that took 20 minutes, that's a real problem. If they didn't describe any friction but say they'd love an AI assistant, that's a hypothetical preference with limited predictive value.

The diagnostic question after every behavioral interview: "Did I observe any evidence of real, recurring friction that this person has taken action to address?" If the answer is no, the use case may not be real enough.

Reading the Signals: Real Use Case vs. Demo That Dies

Enterprise AI adoption benchmarks show a stark bimodal distribution. Bottom-quartile deployments land at 12–18% daily active users and 25–35% weekly active users 90 days after launch. Top-quartile deployments achieve 82–88% weekly active users at the same mark. The technology stack is rarely the differentiator. The differentiator is how well teams understood the problem before building.

There are behavioral patterns that distinguish real use cases from demos that will die at 12% WAU:

Signs of a real use case:

  • Users integrate the AI into existing workflows without being reminded or trained
  • Usage doesn't spike at announcement and then plateau; it grows gradually as users find more applications
  • Users encounter AI errors and continue using the product, because the errors don't derail their workflow
  • Users ask for expansions to adjacent tasks (signal of integration, not just experimentation)
  • Power-user clusters emerge naturally without promotional effort

Signs a demo will die:

  • Usage spikes at launch — driven by novelty — then drops 60–70% within two weeks
  • Adoption plateaus at 10–15% of the target population and holds there
  • Users describe experimenting with the AI but return to their original workflow for real work
  • Questions cluster around "what can this do?" rather than "how do I do X with this?"
  • The use case requires users to fundamentally change their workflow to accommodate the AI, rather than the AI accommodating the workflow

The last point is the most reliable signal. When users have to change how they work to use the AI, the AI is solving a problem the team identified, not one users actually have. When the AI fits into how users already work and removes friction from that existing workflow, you have a real use case.

Running a Prompt Prototype Test Before Committing to Engineering

Between Wizard-of-Oz testing and full engineering commitment, there's a useful intermediate step: the prompt prototype test. Before writing production code, you can validate an AI feature's core value by manually running the AI workflow against real user inputs in a 30-minute session.

For a code review summarization feature, this looks like:

  1. Take five real PRs from your team's recent history
  2. Run them through a carefully constructed prompt against a frontier model
  3. Have two or three engineers evaluate the outputs independently against the actual code

This doesn't test UX, integration, latency, or scale. It tests one thing: does the AI actually produce useful output for the use case you're targeting? That question deserves an answer before any infrastructure is built.

The outputs of a prompt prototype test are:

  • A judgment on whether the core capability is achievable (with current models, at current costs)
  • Early evidence about failure modes — what kinds of inputs produce bad outputs
  • A rough calibration of how much prompt engineering and output curation the production system will require

Teams that run prompt prototype tests before committing to engineering build more accurate estimates and avoid investing months in features that are technically underfeasible or economically unviable.

The Organizational Mistake: Treating User Research as a One-Time Phase

The deeper failure mode in AI user research isn't methodological — it's organizational. Teams treat user research as a phase that happens before building, then stops. For AI features, this is particularly costly because the product's behavior changes as models are updated, as edge cases emerge in production, and as the user population grows beyond the early adopters who helped design it.

High-performing AI teams treat user understanding as continuous infrastructure:

  • Production telemetry that captures failure patterns. When users stop using a feature mid-task, when they revert a suggestion, when they explicitly reject output — these are signals. Instrumenting for them gives you ongoing behavioral data that's more reliable than periodic research sprints.
  • Regular structured review of support and feedback channels. Users who contact support have passed a high threshold to report their problem. Analyzing that feedback systematically surfaces recurring failure modes that aggregate data might obscure.
  • Periodic behavioral interviews with representative users, not just power users. Teams tend to recruit research participants from their most engaged users, which biases findings toward the use cases of people who've already integrated the product. Deliberately including low-engagement users reveals why the 12% WAU threshold is holding.

The teams that achieve 82–88% weekly active usage aren't just better at building AI. They're better at continuous understanding of how real users interact with what they've built, and they close the loop between that understanding and product iteration faster.

Conclusion

The practical implication is uncomfortable for teams that like to move fast: the research that matters for AI features isn't faster or cheaper than traditional user research. It's different. It relies on behavioral observation and Wizard-of-Oz simulation rather than stated preferences and surveys. It requires patience with ambiguity about whether a use case is real before committing to engineering.

The payoff is equally uncomfortable to ignore: 42% of companies abandoned most of their AI initiatives in 2025, and 46% of AI proofs-of-concept never reached production. Most of those failures were discovery failures — teams that built something technically impressive for a problem that turned out not to be real, recurring, or high-friction enough to sustain usage.

Before you write the first prompt, answer the question you can't afford to skip: did I observe real, recurring friction that users have already taken action to address? If yes, build. If not, go back to the interviews.

References:Let's stay in touch and Follow me for more thoughts and updates