Research Agent Design: Why Scientific Workflows Break Coding Agent Assumptions

April 16, 2026 · 10 min read

Software Engineer

Most teams that build LLM-powered scientific tools make the same architectural mistake: they reach for a coding agent framework, swap in domain-specific tools, and call it a research agent. It isn't. Coding agents and research agents share surface-level mechanics — both call tools, both iterate — but their fundamental assumptions about success, state, and termination are almost perfectly inverted. Deploying a coding agent architecture in a scientific workflow doesn't just produce worse results; it produces confidently wrong results, and does so in ways that are nearly impossible to catch after the fact.

The distinction matters urgently now because research agent benchmarks are proliferating, teams are racing to build scientific AI, and the "just use a coding agent" shortcut is generating a wave of plausible-sounding tools that fail in production scientific contexts for reasons their builders don't fully understand.

The Convergence Problem

Coding agents are optimized for convergence. The canonical coding agent loop runs until the code passes tests, compiles cleanly, or produces output matching a spec. Success is a boolean: the repo either builds or it doesn't. The agent's job is to reduce the state space until it finds that fixed point.

Scientific research runs on the opposite principle. A hypothesis is not a program. It doesn't "pass" when the test suite goes green — it becomes more or less credible as evidence accumulates. The agent's job is not to converge but to explore, and to know when the evidence is strong enough to act on and when it demands another round of investigation.

This changes almost everything about how you structure the agent loop. A coding agent that terminates when it finds a working solution is correct by design. A research agent that terminates when it finds a plausible answer is dangerous by design — science is full of plausible answers that turned out to be wrong, and the entire point of the research process is to push past them.

Three practical consequences follow:

Termination criteria are open-ended. You cannot write a deterministic check for "is this hypothesis sufficiently supported." Termination requires confidence scoring, uncertainty quantification, and often a human-in-the-loop gate.
Contradictory evidence must be tracked, not resolved. When a coding agent finds two conflicting tool results, one is wrong and should be discarded. When a research agent finds two conflicting studies, both may be valid and the contradiction is itself informative.
Backtracking is not failure — it's the process. A coding agent that reverts a change is recovering from an error. A research agent that abandons a hypothesis and restarts with a different framing has done exactly what it should.

What the Orchestration Actually Looks Like

The most successful research agent architectures in production today share a recognizable pattern: a coarse outer loop that manages hypothesis state, and one or more inner loops that handle tool execution and evidence collection.

The outer loop maintains a hypothesis bank — a structured record of candidate explanations, their current confidence scores, and the experiments that have informed those scores. Each iteration either strengthens a hypothesis (adds supporting evidence), weakens it (adds contradicting evidence), spawns new candidates (evidence suggests a different framing), or terminates (confidence is high enough and coverage is sufficient).

The inner loops are specialized agents working in parallel or sequence:

A literature agent that retrieves and synthesizes relevant papers. This is not simple RAG; the agent needs to understand citation context (is this paper supporting a claim, or is it the paper being contradicted?), filter for recency and relevance simultaneously, and distinguish between consensus findings and contested claims.
An experiment-design agent that translates hypotheses into testable protocols — either in silico simulations, database queries, or, in wet-lab settings, instructions for human operators.
An analysis agent that interprets results and updates hypothesis confidence scores, flagging when results are ambiguous rather than forcing a classification.
A critic agent whose sole job is to find reasons why the current leading hypothesis might be wrong. In production research systems, the critic role is often the most valuable: the other agents have implicit confirmation bias because they're advancing a specific hypothesis, and you need an adversarial signal to balance it.

FutureHouse's Robin system, which ran a complete end-to-end drug discovery workflow, exemplifies this structure: three specialized sub-agents (literature synthesis, molecule evaluation, data analysis) coordinated through an iterative loop that generates hypotheses, tests them, and refines them across multiple rounds. The key architectural choice is that the outer loop is explicit — no single agent decides whether the research is "done." That judgment lives at the orchestration level, not inside an individual agent's reasoning.

Citation Integrity as a First-Class Requirement

In coding agent design, the analog of a citation is a comment or docstring — useful for humans but not load-bearing for correctness. In research agent design, citations are load-bearing. They are the chain of provenance that lets a human (or a downstream agent) verify that a claim is grounded in real evidence rather than in the model's compressed training data.

This distinction is not theoretical. A 2025 audit of NeurIPS papers found over 100 hallucinated citations across 53 accepted papers — cases where AI-assisted writing produced citations with correct "vibes" (plausible authors, plausible titles, plausible venues) that did not correspond to real papers. GPTZero coined the term "vibe citing" for this failure mode. The citations felt right; they were not.

In a research agent that runs autonomously, the problem is structurally worse than in LLM-assisted writing. An LLM writing a paper with a human in the loop at least has the human checking citations. An autonomous research agent that passes hallucinated citations to a downstream analysis agent produces compounding errors: the analysis references the hallucinated paper as real evidence, subsequent hypothesis revisions treat that evidence as valid, and the final output carries conclusions grounded in nothing.

The fix is architectural, not just prompt-based:

Grounding checks must run at ingestion, not at output. Every paper the literature agent retrieves should be verified against a real database (Semantic Scholar, PubMed, arXiv) before its content enters the hypothesis bank.
Every claim in the hypothesis bank needs a provenance pointer — not "the literature suggests X," but "DOI:10.xxxx/xxxx, section 3.2, states X."
The analysis agent should flag unverifiable claims rather than silently incorporating them. An explicit uncertainty annotation ("this claim could not be verified against a primary source") is far more useful than a confidently stated claim that turns out to be hallucinated.

This is meaningfully different from coding agent design, where you might accept a tool output at face value because tool execution is deterministic. Literature retrieval through an LLM is not deterministic, and cannot be treated as if it were.

The "Correct Answer" Trap

The deepest failure mode in research agents is subtle: it looks like success.

Coding agents are evaluated against ground truth. You run the tests; they pass or fail. Research agents typically get evaluated against quality rubrics — does the report seem well-reasoned, does it cite relevant literature, does it identify key mechanisms? These rubrics are easy to satisfy superficially. A model that confidently synthesizes a plausible-sounding narrative with plausible-looking citations and a coherent argument structure can score well on rubrics while being factually wrong in consequential ways.

The PIES taxonomy of research agent hallucinations (Planning, Implicit, Explicit, Summarization errors) documents this in detail. Planning errors — where the agent pursues a coherent but incorrect hypothesis through all subsequent stages — are the hardest to catch because every downstream step is internally consistent. The agent didn't make an error in any individual tool call; it made an error in its initial framing and then executed that framing flawlessly.

The architectural mitigation is to separate hypothesis generation from hypothesis execution and insert adversarial validation between them. Don't let the same agent that generated a hypothesis also evaluate the evidence for it. This is structurally analogous to the separation between the author and reviewer roles in academic publishing — and for the same reason: the person most committed to a hypothesis is the least reliable evaluator of the evidence against it.

What Actually Works: Design Patterns for Research Agents

A few patterns that hold up across production research agent implementations:

Explicit state machines over unbounded loops. Rather than an "iterate until done" loop, model the research workflow as a finite state machine with explicit transitions: literature_review → hypothesis_generation → experiment_design → analysis → hypothesis_revision. This makes it possible to resume from checkpoints, audit which state a run was in when it failed, and enforce domain constraints at state transitions (e.g., "cannot enter experiment_design without at least N supporting papers").

Entropy-aware planning. Early research phases should maximize exploration — high temperature, broad tool calls, many candidate hypotheses. Later phases should converge — lower temperature, targeted tool calls, head-to-head hypothesis comparison. A research agent that uses the same temperature and exploration width throughout will either underexplore (missing viable hypotheses early) or overexplore (unable to converge later). Build the phase transition into the orchestration logic, not the prompts.

Confidence scores as first-class data. Every hypothesis in the hypothesis bank should carry a confidence score that is updated by each piece of evidence. The score should be updated by a dedicated scoring agent that reasons explicitly about evidence strength — not inferred from the language the analysis agent happens to use. "This study provides strong support" is not a confidence score. 0.73 ± 0.12, with a pointer to the scoring criteria, is.

Dead-letter queues for inconclusive evidence. When an experiment produces ambiguous results, the research agent shouldn't force a classification. Route inconclusive evidence to a holding state, flag it for human review, and track it as an open question rather than collapsing it into a binary. Inconclusive evidence is signal too — a hypothesis that consistently generates ambiguous experiments is a weaker hypothesis than one that generates clean refutations.

Evaluation Requires Domain Ground Truth

A final architectural difference: research agents require domain-specific evaluation infrastructure that coding agents don't need.

Coding agents can be evaluated with generic benchmarks (SWE-bench, HumanEval) because the correctness criterion — does the code work? — is domain-agnostic. Research agents cannot. A research agent producing drug-target hypotheses needs to be evaluated against known drug-target relationships in pharmacology databases. A research agent producing materials science hypotheses needs evaluation against known material properties. The benchmark itself requires domain expertise to construct, and the evaluation pipeline requires database integrations that are specific to the research domain.

This is often the piece that teams skip when they first deploy research agents. They treat the benchmark problem as a secondary concern — we'll figure out evaluation once the agent is working. But without a ground-truth benchmark, you cannot tell whether the agent is getting better or just getting more confident. The two are not the same, and in scientific research, mistaking one for the other is the error you most need to avoid.

Building a research agent is not a prompt-engineering problem layered on top of a coding agent architecture. It requires different state management, different evaluation infrastructure, different failure detection, and a fundamentally different model of what "done" means. The teams building the most effective scientific agents right now treat those differences as first-order design constraints, not afterthoughts.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Research Agent Design: Why Scientific Workflows Break Coding Agent Assumptions

The Convergence Problem

What the Orchestration Actually Looks Like

Citation Integrity as a First-Class Requirement

The "Correct Answer" Trap

What Actually Works: Design Patterns for Research Agents

Evaluation Requires Domain Ground Truth

Recommended Reading

About Tian Pan

The Convergence Problem​

What the Orchestration Actually Looks Like​

Citation Integrity as a First-Class Requirement​

The "Correct Answer" Trap​

What Actually Works: Design Patterns for Research Agents​

Evaluation Requires Domain Ground Truth​

Recommended Reading

About Tian Pan

The Convergence Problem

What the Orchestration Actually Looks Like

Citation Integrity as a First-Class Requirement

The "Correct Answer" Trap

What Actually Works: Design Patterns for Research Agents

Evaluation Requires Domain Ground Truth