Skip to content

HKUSTDial/awesome-data-agents

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

Awesome Data Agents

Awesome PRs Welcome

Curated papers and resources on Data Agents — from the perspective of proposed autonomy levels. Companion list for our (incoming) survey.

If you find this repository useful or inspiring, please kindly give us a star ⭐️ and cite our survey:

Coming soon

Contents

Introduction

Teaser

The way humans interact with data is undergoing a profound transformation. Data agents — LLM-powered systems designed to orchestrate the Data + AI ecosystem — are emerging as a promising solution for automating and democratizing data-related tasks across its lifecycle, from management and preparation to analysis.

However, the term "data agent" is currently used inconsistently across research and industry, resulting in considerable ambiguity. Systems with vastly different capabilities in autonomy, reliability, and task complexity are often labeled the same way. This creates a "Babel Tower" crisis where mismatched expectations and unclear accountability threaten to undermine user trust and impede healthy development of the field.

This repository — a companion to our survey — introduces a layered taxonomy (L0-L5) for data agents based on their degree of autonomy, providing a common framework to clarify capability boundaries and lines of accountability at each level.

Levels of Data Agents

As mentioned above, to bring clarity to the diverse landscape of data agents, we propose a layered taxonomy based on their degree of autonomy. This framework maps the progressive shift of responsibility from human to agent, defining the distinct roles each plays at every stage, as summarized in the overview figure and the table below.

Level Degree of Autonomy Human Role Data Agent Role
L0 Manual/No Autonomy Dominator (Solo) N/A (None)
L1 Assisted Dominator (Editing) Assistant (Auxiliary)
L2 Partial Autonomy Dominator (Orchestrating) Executor (Procedural)
L3 Conditional Autonomy Supervisor (Overseeing) Dominator (Autonomous)
L4 High Autonomy Onlooker (Delegating) Dominator (Proactive)
L5 Full Autonomy N/A (None) Dominator (Generative)

The transition between these levels represents more than just incremental progress; each step up the hierarchy requires a significant evolutionary leap as shown below. These leaps involve fundamental shifts in a data agent's capabilities—such as gaining environmental perception (L1→L2), achieving autonomous orchestrating and dominating the task (L2→L3), attaining proactive self-governance with supervision removed (L3→L4), or pioneering new paradigm (L4→L5).

Leaps

Paper List

We index papers by autonomy level, then by data-related tasks across Data Management, Data Preparation, and Data Analysis. Most existing work clusters in L1–L3, L4–L5 are aspirational. We also list relevant surveys, tutorials and benchmarks.

L0-L1: From Manual Labor to Preliminary Assistance

In L0 level, data-related tasks are performed entirely by human experts without any automation. The process is completely human-driven, requiring extensive domain knowledge and solid technical expertise, making it highly specialized and time-consuming.

L1

At L1 level, data agents start to have the capabilities to provide preliminary and single-point assistance through typical question-answering interactions. While they can help with atomic tasks like code peices generation, they lack environmental perception and require considerable human validation, editing, and optimization.

Data Management

Data Preparation

Data Cleaning
Data Integration

Data Analysis

TableQA (Structured)
NL2SQL (Structured)
NL2VIS (Structured)
Unstructured
Data Storytelling

L2: Perceive the Environment

L2

At this level, data agents gain environmental perception and tool-invocation capabilities, enabling them to execute bounded sub-tasks and multi-step procedures. While they can follow human-orchestrated workflows, the overall process is still dominated by human direction.

Data Management

Data Preparation

Data Cleaning
Data Integration

Data Analysis

TableQA (Structured)
NL2SQL (Structured)
NL2VIS (Structured)
Data Storytelling
Unstructured

Data System

Proto-L3: Striving for Autonomous Data Agents

L3

Level L3 marks a critical transition where data agents evolve from procedural executors into autonomous directors for data-related tasks. At this stage, they can independently decompose high-level goals, orchestrate and optimize tailored, end-to-end pipelines, shifting the human to a supervisory role. While recent pioneering efforts show promise, they are largely considered very early-stage "Proto-L3" systems. Consequently, the pursuit of more autonomous, reliable, versitile and comprehensive L3 data agents remains a key objective in both academia and industry.

Data Management

Data Analysis

Unstructured

Data System

L4-L5: Vision towards Proactive and Generative Data Agents

L4: Vision of Proactive Data Agents

Data agents at L4 can achieve sustained self-governance with proactive monitoring and optimization across the data lifecycle. They can operate autonomously for extended periods without human supervision, actively providing insights and feedback while maintaining reliability.

L4

L5: The Ultimate Vision of Ubiquitous and Generative Data Agents

The ultimate vision of fully autonomous data agents that can function as expert data scientists, capable of knowledge creation and paradigm innovation for data-related tasks.

Survey, Tutorial and Benchmark

Survey

Tutorial

Benchmark

About

Continuously updated handbook and official repository for our survey on Data Agents.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published