HiEviDR-Bench: A Benchmark for Hierarchical Evidence Aggregation in Deep Research

Yubo Sun1, Chunyi Peng2, Yukun Yan3, Zhenghao Liu2, Sen Mei2, Bangrui Xu4, Xuanhe Zhou4 Chi Chen3 Maosong Sun3
1Peking University, 2Northeastern University, 3Tsinghua University, 4Shanghai Jiao Tong University
Interpolation end reference image.

Stage-wise score of HiEviDR-Bench, ranked by overall score(0-100).

Abstract

Deep research requires models to retrieve, connect, and synthesize evidence from large-scale heterogeneous sources to answer complex queries and produce analytical reports. Existing benchmarks mainly evaluate final outcomes, such as answer correctness, report quality, or citation alignment, while providing limited visibility into whether evidence is correctly selected, linked, and aggregated into supported claims and conclusions.

To address this gap, we introduce HiEviDR-Bench, a benchmark for evaluating Hierarchical Evidence aggregation in Deep Research. HiEviDR-Bench covers open-domain and academic-domain settings under both text-only and multimodal conditions, and represents each instance with an explicit evidence graph that captures evidence selection, cross-source linking, and aggregation from evidence to intermediate claims and final conclusions. Based on this formulation, we develop a traceability-oriented evaluation framework with five dimensions—report quality, evidence traceability, citation accuracy, claim verification, and answer correctness—together with a progressive gating mechanism for fine-grained error localization.

HiEviDR-Bench contains 3,407 questions with evidence graphs across multiple difficulty levels. Experiments on 12 representative multimodal large language models show that, although many systems achieve strong report quality, their performance drops markedly on citation accuracy, claim construction, and answer correctness. Further analysis shows that the main bottlenecks lie in evidence identification and intermediate claim construction, revealing that the core limitation of current deep research systems lies in evidence composition and claim-level reasoning rather than report fluency alone.

Evaluation

Each question is paired with heterogeneous evidence items, an evidence graph, and a standard answer, enabling five-dimensional evaluation of a generated report: Report, Traceability, Citation, Claim, and Answer. A progressive gating mechanism is applied to the last three stages to ensure faithful evidence aggregation and grounded answer generation.

Interpolation end reference image.

Construction

We first build a multimodal corpus from two source domains, then select anchor-centered candidate evidence items to construct an evidence graph with an LLM. Based on the graph, we generate question--answer pairs and apply rule-based filtering, Judge-Model checking, and human spot checks to ensure structural validity and data quality.

Interpolation end reference image.

Visualization Case Study

This case study provides an interactive view of HiEviDR-Bench. On the left, the evidence graph explicitly organizes the reasoning process from evidence items, to intermediate claims, and finally to the conclusion. By clicking different nodes, users can inspect how each piece of evidence contributes to specific claims and how these claims are aggregated into the final answer. On the right, we show a corresponding multimodal deep research report, highlighting that HiEviDR-Bench supports both structured traceability analysis and grounded long-form report generation.

Evidence Graph

Q A C1 C2 C3 E1 E2 E3
Conclusion
Claim
Evidence

Node Information

Click any node on the left to display its corresponding information here.