The Unglamorous Work Behind Rapidly Improving AI Products
Most AI teams hit the same wall six weeks after launch. Initial demos were impressive, the prototype shipped on time, and early users said nice things. Then the gap between "good enough to show" and "good enough to keep" becomes unavoidable. The team scrambles — tweaking prompts, swapping models, adding guardrails — and the product barely moves.
The teams that actually improve quickly share one counterintuitive habit: they spend less time on architecture and more time staring at data. Not dashboards. Not aggregate metrics. The raw, ugly, individual failures that live inside conversation logs.
This is a field guide to the practices that separate fast-moving AI teams from ones that stay stuck.
Error Analysis Is the Highest-ROI Activity You're Probably Skipping
When an AI product underperforms, the instinct is to reach for solutions — a better model, a new retrieval strategy, more examples in the prompt. This instinct is almost always premature. Before you can fix the right problem, you need to know what the problem actually is.
Error analysis done right is not glamorous. It looks like sitting with 100-200 conversation traces, writing freeform notes about what went wrong, and then clustering those notes into a taxonomy.
Two approaches exist, and most teams use the wrong one by default:
Top-down analysis starts with predetermined categories — "hallucination," "irrelevant response," "formatting error" — and asks reviewers to label each failure. It's fast to set up and produces clean spreadsheets. It's also terrible at finding the problems that actually matter, because domain-specific failures rarely map onto generic categories you invented before looking at data.
Bottom-up analysis starts with open observation. You read failures and write down what you notice in plain language, without forcing it into a pre-existing bucket. Only after accumulating notes across many examples do you cluster them into categories. The taxonomy emerges from the data rather than being imposed on it.
The payoff is asymmetric. In most AI products, 3-4 failure categories account for 60% or more of all problems. Bottom-up analysis finds these reliably; top-down analysis routinely misses them because the real failure modes weren't on your list to begin with. Aim for theoretical saturation: keep reviewing until no genuinely new failure types appear. For most products, this happens around 150-200 examples.
The Data Viewer Is Infrastructure, Not a Nice-to-Have
Iteration speed is directly limited by how painful it is to review AI outputs. Most teams tolerate enormous friction here — hopping between log viewers, querying databases, opening multiple tabs to reconstruct a single conversation. Each extra click is a tiny tax on velocity, and these taxes compound.
The highest-leverage investment for an early AI team is often a custom data viewer: a lightweight internal tool that displays everything needed to evaluate a single interaction in one place.
The requirements are minimal but specific:
- Full context visible without navigation (the prompt, the retrieved documents, the conversation history, the model output)
- One-click feedback capture — correct, incorrect, or flagged
- Open text field for qualitative notes
- Keyboard shortcuts for power users
Generic observability platforms are good for infrastructure visibility. They're poor substitutes for a domain-specific tool designed around the shape of your data and the judgment calls your reviewers need to make. A simple custom viewer built in a day routinely outperforms an expensive platform because it removes the friction that slows down human review.
The test: if a domain expert (not an engineer) can load the tool, review 50 conversations in an hour, and capture structured feedback without asking anyone for help, it's working.
Domain Experts Should Write Prompts
One of the most persistent bottlenecks in AI product development is the assumption that prompts are engineering artifacts. They're not. A prompt is a specification — it describes what good behavior looks like, what edge cases matter, and what failure modes to avoid. The people who know this best are domain experts, not engineers.
The barrier is language. Engineers unwittingly gatekeep the process by using jargon that non-engineers don't recognize:
- "RAG" instead of "making sure the model has the right context before answering"
- "prompt injection" instead of "users trying to trick the AI into ignoring its rules"
- "hallucination" instead of "the model sometimes confidently makes things up"
Once the language is accessible, domain experts can contribute directly — and they tend to catch failure modes that engineers would never notice, because they understand what correct behavior actually looks like in context.
The structural fix is building what might be called an integrated prompt environment: an admin version of your actual product UI where domain experts can edit prompts and immediately see how changes affect real outputs, within the context they care about. Not a standalone playground, not a Jupyter notebook. The actual interface, with an edit layer on top.
This does more than accelerate iteration. It creates a feedback loop between the people who know what "good" looks like and the mechanism that defines good behavior.
Synthetic Data Solves the Cold-Start Problem
The standard objection to evaluation before launch is circular: "We don't have data yet, so we can't evaluate." This is mostly wrong.
Realistic synthetic data can be generated before a single real user interaction occurs. The key is grounding it in actual system constraints — real database schemas, actual business rules, specific regulatory requirements — rather than asking an LLM to invent plausible-sounding conversations from scratch.
A useful framework for generating synthetic test data works across three dimensions:
- Capabilities: What core tasks should the system perform? (Search, summarize, schedule, recommend)
- Scenarios: What situations arise for each capability? (Exact match found, multiple matches, nothing found, ambiguous input)
- Personas: What user types will interact with the system? (Power users, first-timers, users with unusual goals)
Generate user inputs, not expected outputs. The point is to create situations that stress-test the system, not to pre-define correct answers that might constrain how you evaluate real responses later.
Modern LLMs are surprisingly effective at generating diverse, realistic user prompts when given sufficient grounding context. Teams that build evaluation infrastructure before launch arrive at launch with measurement capabilities already in place — which is when iteration speed matters most.
Binary Evaluation Over Numerical Scores
The intuitive approach to measuring AI output quality is a rating scale: 1-5, or 1-10, with higher meaning better. Rating scales feel precise. They're usually not.
Numerical ratings require evaluators to make implicit, often inconsistent decisions about where lines fall. A response that one person rates 3 another rates 4. As criteria evolve — which they always do as you observe more outputs — calibrating historical ratings becomes a secondary problem layered on top of the original measurement challenge.
Binary evaluation (pass/fail) removes this ambiguity. It forces a single question: does this output meet the bar or not? This is harder to answer for borderline cases, which turns out to be a feature rather than a bug. Borderline cases reveal where criteria are underspecified, and resolving that ambiguity makes the evaluation system more useful.
The critical addition to binary evaluation is the critique — a written explanation of why something passed or failed. These critiques serve two purposes simultaneously:
- They make the evaluation auditable and useful for human review
- They function as few-shot examples that measurably improve the performance of LLM-based judges on similar inputs
Teams that pair binary decisions with written critiques and then track human-LLM agreement rates (targeting above 90%) end up with evaluation infrastructure that scales. The human effort shifts from labeling every output to calibrating the system — a much better use of expert time.
Structure Roadmaps Around Experiments, Not Features
AI product roadmaps built around feature delivery have a structural problem: feasibility is uncertain in a way it usually isn't for traditional software. You don't know if a capability is achievable until you've run experiments. Committing to timelines before running those experiments means committing to the wrong things.
The alternative is organizing roadmaps around experiments with explicit decision points.
A capability doesn't go from "idea" to "shipped" in one step. Progress is gradual:
- Can the system respond at all?
- Can it execute without errors?
- Can it return relevant results?
- Does it match what the user actually wanted?
- Does it optimally solve the problem?
Mapping roadmap items to stages on this funnel, rather than binary done/not-done, gives a more accurate picture of where work actually stands. Timeboxed exploration phases — two weeks for data feasibility, a month for technical feasibility, six weeks for a prototype — with explicit go/no-go decisions at each stage prevent the open-ended drifts that kill AI projects.
The cultural corollary: normalize talking about what didn't work. Teams that regularly share failed experiments learn faster than teams that only document successes, because failure data contains information about the problem space that success data doesn't.
The metric that predicts improvement speed better than almost anything else is the number of complete experiments run per unit time. More experiments — designed, executed, evaluated, and learned from — means faster improvement, almost regardless of what else is happening.
The Common Thread
None of these practices are technically exotic. They don't require the latest model, the most sophisticated architecture, or the largest evaluation infrastructure budget. What they require is a disciplined commitment to looking at actual behavior, making measurement easy, involving the people who know what "good" looks like, and treating experimentation as the primary unit of work.
The teams that ship AI products that get measurably better over time aren't necessarily smarter or better resourced. They've just replaced the instinct to add complexity with the habit of looking at data first.
That's unglamorous. It's also almost always what works.
