The Retrieval Bottleneck: A Controlled Comparison of 13 Deep-Research Architectures

Deep-research agents are easy to admire and almost impossible to compare.

One system uses a supervisor and worker fan-out. Another decomposes the query into a graph. A third expands perspectives in the style of STORM; a fourth trains its own search agent with reinforcement learning; a fifth just asks a stronger model to write the report after a single retrieval pass. Then the papers arrive with confident architecture claims — and the model, the tool stack, the query set, the retry policy, and the citation layer have all changed at the same time. When five variables move together, no result can be attributed to any one of them.

This benchmark fixes that. It holds the model family, the tool layer, the query manifest, and the judging protocol as constant as a research budget allows, and varies only the orchestration architecture. The question is deliberately narrow: once a deep-research system has enough retrieval and synthesis structure to be competent, how much does the architecture actually matter?

The short answer surprised me. Six different architectures finished in a statistical dead heat for the top. The base model mattered about twice as much as any of them. And the highest-scoring system of all was citing 62.9% fake citations that the judges happily rewarded.

Controlled benchmark at a glance

What the deep-research study measured

deep-research architectures compared

queries in the shared manifest

frontier judges in the panel

rubric dimensions scored

190,222

single-judge criterion verdicts

61,318

consensus-eligible triples

348

Bing-vs-Tavily intervention runs

42,518

URL cache entries for C0 verification

Architecture

Six GPT-4o pipelines formed a top cluster

mean 0.601-0.6739/15 pairs TOST-equivalent at ±0.05 ROPE0/15 at ±0.02

Once a system has competent retrieval and synthesis structure, more elaborate orchestration stops buying reliable gains. The cluster is power-justified equivalence, not a single winner.

Model scale

Model scale dominated the architecture effects

GPT-4o vs local 7B gap ≈0.40≈2x baseline-to-top range≈5x intra-orchestration range

Architecture had real, measurable effects — but base-model capability mattered several times more in this setup.

Judge distillation

DR-Judge worked for the easy cases only

overall κ 0.453undisputed κ 0.652disputed κ 0.199

The small judge missed its pre-registered κ ≥ 0.7 target. It is a triage layer for undisputed verdicts, not a panel replacement.

RL training

Reward design mattered more than RL itself

P12 GRPO 0.182 ≈ P9 base 0.180P10 outcome-RL beat base by Δ0.078

Rubric-judge-supervised GRPO did not lift the 7B policy; outcome-supervised RL (DeepResearcher-7B) did. A bounded result about reward design, not an anti-RL claim.

Limits

The retrieval bottleneck is inferred, not demonstrated — the tool layer was held constant, so no intervention isolates retrieval. Report quality, citation provenance, and factuality are all scored by LLMs, with no human reference-rating pilot. Substantive-dimension judge agreement is weak (Krippendorff α 0.105-0.135). The 90-query manifest is English-language only.

Experiment design

The whole benchmark in one control loop

The benchmark fixes the model family, tool layer, query manifest, and judging protocol, then varies only the orchestration architecture — so a quality difference can be attributed to one variable instead of five moving at once.

Fixed inputs

Hold the comparison still

The benchmark fixes the query manifest, tool layer, retry behavior, and scoring protocol so architecture is not being compared through a fog of unrelated changes.

90 queriesShared retrieval stackSame rubric

Varied systems

Change the research architecture

The systems under test span single-pass retrieval, iterative RAG, supervisor fan-out, STORM-style perspectives, graph decomposition, beam search, local 7B baselines, and RL agents.

13 architecturesGPT-4o pipelinesLocal 7B additions

Judged outputs

Score with a panel, not a vibe

A three-frontier-judge panel scores reports across nine rubric dimensions, then the analysis checks agreement, equivalence, run variance, and judge failure modes.

3 judges9 dimensionslarge verdict set

Audited claims

Inspect the source layer

The work audits citation provenance, tests a C0 verification cache, reports DR-Judge limits, and keeps the P12 GRPO result negative rather than smoothing it away.

citation cacheconsensus checksbounded negatives

The setup, briefly

The study compares 13 architectures across three families, all running against the same 90-query manifest and the same tool layer (web search, the Semantic Scholar and arXiv APIs, page extraction, a shared retry policy). Nine are GPT-4o pipelines spanning the usual design space — a single-pass baseline, iterative RAG, a supervisor fan-out, a topic-mining loop, STORM-style perspective expansion, hierarchical width-and-depth, reactive interleaving, graph decomposition, and beam search over outlines. Two are local 7B systems: a plain baseline and one RL-trained end-to-end for research. Two more were added afterward: a canonical ReAct loop, and a 7B agent I trained in-house.

Every report is scored by a panel of three frontier judges from deliberately different model families, across a nine-dimension rubric with 38 binary criteria. Using three families is what lets the study isolate one variable at a time: baseline-versus-the-rest isolates orchestration, the 7B-versus-GPT-4o contrast isolates model scale, and trained-versus-untrained isolates RL.

Before comparing a single architecture, the panel itself had to be validated — a benchmark scored by unreliable judges is just an opinion with error bars. The judges agreed well on style-shaped dimensions like organisation and badly on substance-shaped ones like factual accuracy and attribution; on attribution they agreed worse than chance. That split is not noise to average away. It means the panel can certify relative rankings and cluster equivalence, but not exact dimension-level scores to three decimals on the substantive dimensions. Every number below is read in that light. (The agreement statistics are in the methods section.)

A top cluster, not a winner

The headline architecture result is less dramatic than a leaderboard, and more useful.

Six GPT-4o pipelines formed a statistically indistinguishable top cluster — iterative RAG, STORM, hierarchical width-and-depth, reactive interleaving, graph decomposition, and beam search. Their mean scores sat in a band narrow enough that ordering them would be reading noise as signal. To make "indistinguishable" precise, the study used equivalence testing rather than ordinary significance testing: it asked "are these two architectures equivalent within a margin declared in advance?", a question that can return a real positive answer, and most of the within-cluster pairs came back formally equivalent.

Pareto frontier of overall score versus cost proxy across deep-research architectures

Here is what that buys you in practice. Orchestration was not cosmetic — the single-pass baseline never reached the top cluster, so some structure pays off. But once a system had enough retrieval and synthesis scaffolding to be competent, more elaborate orchestration stopped buying reliable gains. Plotting score against compute cost, the top cluster occupies a roughly 4× cost band with no clear quality ordering inside it. You can pay four times as much for orchestration and land in the same place.

Model scale hit harder than architecture

If orchestration produced a soft, clustered result, model scale produced a blunt one. The gap between the GPT-4o pipelines and the local 7B systems was about twice the entire architecture span measured across the GPT-4o family — and by some anchorings closer to five times. The exact multiplier depends on which range you compare against, but the direction is not in doubt: in this setup, base-model capability dominated every architecture effect observed.

That does not make architecture irrelevant. The largest single component effect measured anywhere in the study was STORM's triangulation stage — removing it caused a robustly significant loss. Architecture has real, measurable effects; they just have to be reported as effect sizes, not leaderboard positions. And the failure modes differed by family in a telling way: GPT-4o systems failed by producing fluent, citation-shaped reports that still needed provenance checks, while 7B systems failed visibly, by producing nothing gradeable at all.

The citation problem was worse than the score showed

The most uncomfortable finding was not about any architecture. It was about citations, and about what the judges were actually rewarding.

A provenance audit classified nearly 23,000 citations across the reports into placeholder, real-URL, academic, and malformed buckets. Several top-cluster patterns turned out to be citing mostly placeholders — text-only "Web Search Synthesis" strings rather than concrete URLs. STORM, the highest-scoring pattern overall, produced 62.9% placeholder citations. Three others sat above 55%. And the judges' citation-quality scores tracked citation density, not academic-source rate: a report with many bracketed references read as rigorous to the panel whether or not those references pointed anywhere. The judges were partly rewarding the appearance of grounding.

Citation provenance audit showing placeholder, URL, academic, and suspicious citation rates by pattern

So the study ran a stricter test: decompose each report into atomic factual claims, resolve each inline citation against a URL cache, and ask whether the cited source directly supports the claim. Under that strict rubric, only 27% of claims were directly supported, just 0.5% were contradicted, and 69.5% landed in a neutral bucket — on-topic, but the source does not directly state the claim.

Strict citation-grounded verification summary showing neutral-heavy entailment results

That 69.5% is easy to misread. The near-zero contradiction rate is reassuring: these reports rarely emit claims their own sources flatly refute. But the neutral mass is largely a measurement artefact, not a quality verdict. Deep-research reports synthesise — they combine and infer across multiple sources — and a rubric borrowed from short-form factuality checking, where atomic facts appear verbatim in one source, cannot score that. Re-running a sample under a rubric that admits multi-source triangulation lifted the "supports" rate substantially. Verification is not impossible here; it just needs a rubric built for synthesis.

The backend swap, the small judge, and the RL run

Three further experiments are worth keeping mostly because they refused to give clean answers.

The backend swap made the story messier, not cleaner. A simple theory would be: replace the default web search with a backend that returns clean real URLs, and scores rise. Under one judge, the cleaner backend depressed scores across every pattern — because it returned far fewer citations, and some orchestration loops kept emitting synthesis text proportional to perspective count even when the retrieved evidence was thin. Under a different judge family, the effect mostly reversed. The naive "cleaner URLs, better scores" hypothesis was inverted by pattern-internal logic and by which judge you asked.

Bing-versus-Tavily backend protocol: per-pattern deltas and judge-specific caveats

A small judge worked for the easy cases only. I fine-tuned a 7B model on panel-consensus verdicts to see whether a cheap local model could approximate the frontier panel. On verdicts where all three panel judges already agreed, it reached near-substantial agreement; on verdicts where the panel split, it collapsed. That makes it a useful triage layer, not a panel replacement — and it is a methodological point too: a model trained on majority labels learns the easy boundary and fails exactly where judgement is most valuable, because the tie-breaking logic is not in the labels.

Training a research agent with RL, and watching it not improve. I RL-trained a 7B policy using that small judge as the reward model. The run completed, the reward signal had a gradient to follow — and the policy did not improve, scoring statistically tied with the untrained base. The bound matters: this is not an anti-RL claim. The study's own counter-evidence is a different 7B agent, trained with outcome supervision rather than rubric-judge supervision, that did beat its base. The reward-design axis mattered as much as the RL itself. A noisy judge does not make a useful reward model.

Process quality is not outcome quality

A trajectory audit scored four process dimensions against final outcome. Only one — whether the system revised its conclusions based on what it found — strongly predicted quality. Retrieval diversity was slightly anti-correlated with it.

That sounds strange until it connects back to the citation audit. More retrieval activity is not automatically better: more citation-shaped artefacts can mean more real evidence, or more surface area for impressive-looking placeholders. For agentic systems the design lesson is sharp — counting tool calls is not measuring research, and a trace is only worth instrumenting if it exposes whether the system actually changed its mind.

For the methods-minded

Everything above is the readable layer. The exact agreement coefficients, equivalence margins, entailment buckets, and effect sizes are below — skip them if you only wanted the conclusions.

The statistics, in full

Panel reliability was ICC(A,k=3) = 0.742 on the panel-averaged overall score — conventional "good" agreement — but Krippendorff's α (chosen because it handles missing verdicts and arbitrary scales) ranged from 0.840 on organisation down to −0.102 on attribution quality, with factual accuracy at 0.135 and information recall at 0.105. A negative α is agreement worse than chance; some rubric dimensions simply do not have a single answer a panel can converge on.

The six-pattern top cluster ran mean overall scores of 0.601–0.673. Under paired-Wilcoxon TOST at a ±0.05 region of practical equivalence, 9 of 15 pairs were formally equivalent; at a stricter ±0.02 ROPE, 0 of 15 were. The ±0.05 margin was matched to the design's minimum detectable effect (MDE₈₀ = 0.040), so the architectures are equivalent at the margin the data can resolve, no tighter. A run-variance probe put pooled run-to-run noise at σ(run) = 0.050 — the same size as the equivalence margin — which is why single-run pattern means should not be read to three decimals; the cluster claim rests on the full 90-query, three-judge design where paired averaging cancels much of that noise.

Critical-difference chart showing the statistically indistinguishable top cluster

Run-to-run variance across selected architectures

Model scale: the GPT-4o-versus-7B gap was about 0.40 in overall score. Against the baseline-to-top range within GPT-4o (0.185) that is ~2×; against the intra-orchestration range with the single-pass baseline excluded (0.072) it is ~5×. The single-pass baseline scored 0.488. STORM's triangulation ablation cost −0.060. A failure taxonomy over 49,442 unsatisfied criterion verdicts found citation fabrication 8× more common in GPT-4o pipelines (3.5% vs 0.4%) and empty-or-sparse output 7× more common in 7B systems (5.8% vs 0.8%).

Per-pattern, per-dimension heatmap of where architectures improved and hit ceilings

Process quality versus compute cost across architectures

Citations: the provenance audit covered 22,903 citations; STORM 62.9% placeholder, topic-mining 61.4%, beam search 60.4%, hierarchical 55.3%, while iterative RAG, reactive interleaving, graph decomposition, and the baselines produced almost none. Judge citation-quality scores correlated with density (r = +0.28) but not academic-source rate (r = +0.01, n.s.); placeholder rate carried a small positive coefficient on factual-accuracy score (β = +0.026) even controlling for pattern, count, and length. The strict entailment pass: 26.8% supported, 0.5% contradicted, 69.5% neutral; a 50-claim re-run under a triangulation-aware rubric lifted "supports" from 16% to 40%.

Effect of filtering placeholder citations before rerank

The small judge reached Cohen's κ = 0.453 overall against the panel — 0.652 on undisputed verdicts, 0.199 on disputed — missing its pre-registered κ ≥ 0.7 target. The RL agent scored 0.182 versus 0.180 for the untrained base (tied); the outcome-supervised counter-example beat its base by Δ = 0.078. The process audit found iterative refinement predicted outcome at Spearman ρ = +0.90, retrieval diversity at ρ = −0.10.

DR-Judge agreement with the three-judge panel across patterns and dimensions

GRPO reward curve and panel-side outcome for the in-house RL agent

Process-trajectory audit linking retrieval diversity, tool efficiency, reasoning coherence, and iterative refinement to outcome

Limitations

The retrieval bottleneck is inferred, not demonstrated. The tool layer was held constant across all patterns, so no intervention isolates retrieval as the binding constraint. An oracle-retrieval ablation would be needed to prove it; the thesis stays at "consistent with."
Single tool layer, single query distribution — all English-language, Western-slanted. The results do not speak to multilingual, domain-restricted, or code-execution research.
LLM-as-judge throughout. No human reference-rating pilot was run; within-judge re-test reliability and cross-version triangulation partly substitute, but neither establishes human ground truth.
Substantive-dimension agreement is weak, so the panel certifies relative rankings and cluster equivalence, not exact substantive scores.
The post-hoc patterns have thinner coverage — scored on one judge only, and the process audit instrumented just the patterns that emitted execution traces.

What to check before you trust your own agent

Strip away the machinery and the result is plain: once a deep-research system is competent, the architecture you pick matters less than you expect, the base model matters more, and the citation layer matters in a way the quality score never showed. Six pipelines tied for the top, and the highest scorer among them was citing 62.9% placeholders that the judges rewarded as rigour.

That last number is the one to act on. Take the deep-research agent you already run and audit its citations the way this study did — classify each into placeholder text, real URL, and academic source, then check what fraction of its claims a cited page actually supports. If your judge or your dashboard is rewarding bracket density rather than grounding, you will find it in an afternoon, and you will trust your own quality scores a little less and your provenance check a lot more. The architecture is the part everyone argues about. What the system cites is the part that turned out to matter.