*Wei Liu* :kcl_inf: , Peijie Yu :tencent-logo-symbol-vector:, Michele Orini :kcl_bio: , Yali Du :kcl_inf: , Yulan He ✉️ **:kcl_inf:

:kcl_inf: King’s College London, Informatics
:kcl_bio: King’s College London, Biomedical Engineering
:tencent-logo-symbol-vector: Tencent

check interactive project page at https://huggingface.co/spaces/thinkwee/DDR_Bench

<aside>

💡 TL;DR

We introduce the task of Deep Data Research (DDR), where LLMs are asked to dive into the database, explore whatever insight they think that is important. Not just QA, Coding, Reasoning, but ****fully autonomous Data2Insights w/o any human-defined question/target/instruction.
We also introduce DDR-Bench, a fully verifiable benchmark for evaluating the Model generated insights, not just subjective rubrics.
DDR Bench is far from saturated, with SOTA models typically achieving average leaderboard scores below 40 percent. Current models exhibit systematic deficiencies across multiple dimensions, including agency, the depth and breadth of exploration, and the ability to translate data/statistics into meaningful insights.
LLMs extract more accurate insights from delaying commitment, and they concentrate reasoning into a small number of highly valuable late-stage inter-actions. These targeted interactions are built upon longer early exploration.
Advanced LLMs tend to operate in a balanced exploration regime that combines adequate coverage with focused access. Such a regime is consistently observed across different databases and exploration trajectories.
Meaningful gains instead require a systematic agentic-first training strategy, including targeted pre-training and reinforcement learning. Neither scaling model parameters nor extending context length is sufficient to elicit stronger agency. </aside>

Introduction

The emergence of agentic large language models represents a fundamental transition in AI, from executional intelligence that follows instructions to investigatory intelligence that autonomously explores, reasons, and plans. Current agents, though capable of autonomy, still operate under goal-conditioned settings, pursuing predefined objectives rather than generating targets on their own. True proactivity requires the ability to decide what to investigate before determining how to do so, a capacity that remains largely untested.

If an agent is to autonomously determine its own goals, it must possess some capacity to observe its environment and set goals based on those observations. Data science tasks provide an excellent testbed for evaluating the proactivity of LLM agents, as for an LLM, all environmental feedback and observations can ultimately be represented as some form of data. In a sense, wherever data exists, an agent should be capable of autonomous analysis rather than being driven solely by externally posed questions. However, there are currently few benchmarks for such data science agents. Achieving agents with the highest level of autonomy remains one of the most important north-star objectives today.

To bridge this gap, we propose the task of Deep Data Research, designed to evaluate whether models can autonomously set investigative goals and extract meaningful insights from complex databases without manually specified problems and goals. Think about Deep Research，where agentic LLMs are asked to search the Internet for open-ended research. In Deep Data Research, Agentic LLMs perform “Deep Research” on structured databases. Deep Data Research requires more than web queries, since LLMs can write code, execute SQL to perform much more complicated searches and reasoning on the database.

This task reflects how expert data scientists work in practice, continually hypothesizing, probing, and interpreting data to uncover patterns and relationships that were not explicitly sought. Building on this framework, we introduce DDR-Bench, a large-scale benchmark that measures proactive exploration in data science through verifiable, sample-wise checklist evaluation, providing controlled yet open-ended settings for analysis.

What is Deep Data Research?

Given a database $D$, an LLM needs to use tool set $T$ to query the database iteratively for up to $N$ rounds, stopping only when it deems that it has collected sufficient information. The LLM has no explicit question to answer and no predefined objective; it only receives a basic start prompt specifying the task entity, for example, “Start analyzing the user with userid=2048.”

In the first round, the LLM receives basic information about the database: the available tables and a brief description of each table. In each subsequent round, the LLM observes all prior results and then outputs reasoning tokens $r$ and tool invocation tokens $t$, after which the tool executes on the database and produces results $o$. Through this ReAct-style interaction $(r, t, o)$ over multiple rounds, the LLM autonomously decides when to stop and produces two types of insights:

Message-wise insight $I_m$: the LLM interprets the results $o$ from each round of database tool execution into insights, forming a list of insights.
Trajectory-wise insight $I_t$: the LLM observes the full trajectory $\{(r_i, t_i, o_i)\}_{i=1}^N$ and summarizes it into a single insight paragraph.