*Wei Liu* :kcl_inf: , Peijie Yu :tencent-logo-symbol-vector:, Michele Orini :kcl_bio: , Yali Du :kcl_inf: , Yulan He ✉️ **:kcl_inf:
work in progress, arxiv/code/dataset coming soon~
<aside>
💡 TL;DR
The emergence of agentic large language models represents a fundamental transition in AI, from executional intelligence that follows instructions to investigatory intelligence that autonomously explores, reasons, and plans. Current agents, though capable of autonomy, still operate under goal-conditioned settings, pursuing predefined objectives rather than generating targets on their own. True proactivity requires the ability to decide what to investigate before determining how to do so, a capacity that remains largely untested.
If an agent is to autonomously determine its own goals, it must possess some capacity to observe its environment and set goals based on those observations. Data science tasks provide an excellent testbed for evaluating the proactivity of LLM agents, as for an LLM, all environmental feedback and observations can ultimately be represented as some form of data. In a sense, wherever data exists, an agent should be capable of autonomous analysis rather than being driven solely by externally posed questions. However, there are currently few benchmarks for such data science agents. Achieving agents with the highest level of autonomy remains one of the most important north-star objectives today.
To bridge this gap, we propose the task of Deep Data Research, designed to evaluate whether models can autonomously set investigative goals and extract meaningful insights from complex databases without manually specified problems and goals. Think about Deep Research,where agentic LLMs are asked to search the Internet for open-ended research. In Deep Data Research, Agentic LLMs perform “Deep Research” on structured databases. Deep Data Research requires more than web queries, since LLMs can write code, execute SQL to perform much more complicated searches and reasoning on the database.
This task reflects how expert data scientists work in practice, continually hypothesizing, probing, and interpreting data to uncover patterns and relationships that were not explicitly sought. Building on this framework, we introduce DDR-Bench, a large-scale benchmark that measures proactive exploration in data science through verifiable, sample-wise checklist evaluation, providing controlled yet open-ended settings for analysis.
Given a database $D$, an LLM needs to use tool set $T$ to query the database iteratively for up to $N$ rounds, stopping only when it deems that it has collected sufficient information. The LLM has no explicit question to answer and no predefined objective; it only receives a basic start prompt specifying the tasked entity, for example, “Start analyzing the user with userid=2048.”
In the first round, the LLM receives basic information about the database: the available tables and a brief description of each table. In each subsequent round, the LLM observes all prior results and then outputs reasoning tokens $r$ and tool invocation tokens $t$, after which the tool executes on the database and produces results $o$. Through this ReAct-style interaction $(r, t, o)$ over multiple rounds, the LLM autonomously decides when to stop and produces two types of insights: