Skip to main content

The Conversation Designer's Hidden Role in AI Product Quality

· 10 min read
Tian Pan
Software Engineer

Most engineering teams treat system prompts as configuration files — technical strings to be iterated on quickly, stored in environment variables, and deployed with the same ceremony as changing a timeout value. The system prompt gets an inline comment. The error messages get none. The capability disclosure is whatever the PM typed into the Notion doc on launch day.

This is the root cause of an entire class of AI product failures that don't show up in your eval suite. The model answers the question. The latency is fine. The JSON validates. But users stop trusting the product after three sessions, and the weekly active usage curve never recovers.

The missing discipline is conversation design. And it shapes output quality in ways that most engineering instrumentation is architecturally blind to.

The Prompt Is Product Copy, Whether You Treat It That Way or Not

When you write "You are a helpful assistant. Answer user questions accurately and concisely," you have made a series of product decisions:

  • What persona does the product have? (Generic.)
  • How should it handle ambiguity? (It won't — you didn't say.)
  • What should it do when it doesn't know something? (Undefined.)
  • What's the boundary between this product and everything else it might be asked to do? (Not established.)

These omissions don't mean no decisions were made. The model will fill the gaps with something — usually a blend of training defaults that produces behavior inconsistent with your actual product goal. You've written configuration. The model is executing product copy you didn't write.

Research bears this out in measurable terms. Variations in prompt phrasing, formatting, and vocabulary produce accuracy swings of up to 76 percentage points on structured tasks. The difference between "list the top three options" and "provide three recommendations, ordered by suitability" isn't stylistic — it changes how the model evaluates relevance and structures its reasoning. Ambiguity doesn't produce neutral outputs. It produces outputs weighted toward whatever the training distribution most commonly saw for that surface form.

What Conversation Designers Actually Do That Engineers Don't

Conversation design as a formal discipline predates LLMs — it emerged from voice assistants, IVR systems, and chatbot product design in the 2010s. Its core concern is the communicative contract between a system and a user: what the system can do, how it expresses uncertainty, how it recovers from failure, and how it maintains the user's trust when things go wrong.

When applied to LLM systems, this breaks into four concrete problem areas:

Persona and tone calibration. The framing of a system prompt establishes a model's default register — the level of formality, the vocabulary complexity, the degree of hedging. A financial planning assistant that uses the same register as a casual creative writing tool produces cognitive dissonance users can't articulate but definitely feel. They say the product feels "off" in user research. The engineers look at the evals and see nothing wrong.

Instruction hierarchy and conflict resolution. System prompts for production features routinely accumulate contradictions. "Be concise" and "always provide full context" sit in the same prompt. The model's behavior when resolving these conflicts is not random — it's shaped by instruction ordering, phrasing, and implicit priority signals. Conversation designers know to audit these conflicts explicitly. Engineers typically discover them when a user files a bug report.

Failure and edge-case scripting. When a model can't complete a request — because it's out of scope, because the information isn't available, because the user input is malformed — the failure response is a product decision, not a graceful degradation. Generic fallbacks ("I'm sorry, I can't help with that") damage trust. Specific, actionable responses ("I can't look up your account balance, but you can find it in the portal at Settings > Account") preserve trust and often increase the chances the user achieves their goal anyway.

Capability disclosures. Users systematically over-estimate what AI systems can do after an impressive demo, then calibrate sharply downward after the first failure. Proactive, accurate capability signaling — embedded in the product's communication patterns, not buried in an FAQ — keeps expectations calibrated and prevents the trust collapse that follows the first unexpected miss.

The Instrumentation Gap

Here's the specific problem: engineering teams measure what's easy to measure. Token cost, latency, error rates, eval pass rates. These are real and important metrics. They're also completely silent on the dimensions conversation design affects.

You cannot measure "the user feels subtly less confident in this product" with a JSON schema validator. You cannot catch "this error message made the user blame themselves for the model's failure" in a unit test. You cannot see "this capability disclosure is causing users to over-trust and then over-correct" in your latency dashboard.

Companies that resource conversation design separately from engineering see measurably higher AI feature adoption rates. This isn't because conversation designers have magical intuitions — it's because they instrument for different signals. Session depth. Return rate after the first failure. Conversation repair rate (how often users rephrase after getting an unsatisfying response). Edit rate on AI-generated content. These are the metrics that predict whether a feature becomes a habit or a disappointment.

A/B Testing Prompt Language Rigorously

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates