Awesome Data Agents

Curated papers and resources on Data Agents — from the perspective of proposed autonomy levels. Companion list for our (incoming) survey.

If you find this repository useful or inspiring, please kindly give us a star ⭐️ and cite our survey:

Coming soon

Introduction

The way humans interact with data is undergoing a profound transformation. Data agents — LLM-powered systems designed to orchestrate the Data + AI ecosystem — are emerging as a promising solution for automating and democratizing data-related tasks across its lifecycle, from management and preparation to analysis.

However, the term "data agent" is currently used inconsistently across research and industry, resulting in considerable ambiguity. Systems with vastly different capabilities in autonomy, reliability, and task complexity are often labeled the same way. This creates a "Babel Tower" crisis where mismatched expectations and unclear accountability threaten to undermine user trust and impede healthy development of the field.

This repository — a companion to our survey — introduces a layered taxonomy (L0-L5) for data agents based on their degree of autonomy, providing a common framework to clarify capability boundaries and lines of accountability at each level.

Levels of Data Agents

As mentioned above, to bring clarity to the diverse landscape of data agents, we propose a layered taxonomy based on their degree of autonomy. This framework maps the progressive shift of responsibility from human to agent, defining the distinct roles each plays at every stage, as summarized in the overview figure and the table below.

Level	Degree of Autonomy	Human Role	Data Agent Role
L0	Manual/No Autonomy	Dominator (Solo)	N/A (None)
L1	Assisted	Dominator (Editing)	Assistant (Auxiliary)
L2	Partial Autonomy	Dominator (Orchestrating)	Executor (Procedural)
L3	Conditional Autonomy	Supervisor (Overseeing)	Dominator (Autonomous)
L4	High Autonomy	Onlooker (Delegating)	Dominator (Proactive)
L5	Full Autonomy	N/A (None)	Dominator (Generative)

The transition between these levels represents more than just incremental progress; each step up the hierarchy requires a significant evolutionary leap as shown below. These leaps involve fundamental shifts in a data agent's capabilities—such as gaining environmental perception (L1→L2), achieving autonomous orchestrating and dominating the task (L2→L3), attaining proactive self-governance with supervision removed (L3→L4), or pioneering new paradigm (L4→L5).

Paper List

We index papers by autonomy level, then by data-related tasks across Data Management, Data Preparation, and Data Analysis. Most existing work clusters in L1–L3, L4–L5 are aspirational. We also list relevant surveys, tutorials and benchmarks.

L0-L1: From Manual Labor to Preliminary Assistance

In L0 level, data-related tasks are performed entirely by human experts without any automation. The process is completely human-driven, requiring extensive domain knowledge and solid technical expertise, making it highly specialized and time-consuming.

At L1 level, data agents start to have the capabilities to provide preliminary and single-point assistance through typical question-answering interactions. While they can help with atomic tasks like code peices generation, they lack environmental perception and require considerable human validation, editing, and optimization.

Data Management

Query Rewriting via LLMs — arXiv
Query Performance Explanation through LLM for HTAP Systems — arXiv
Demonstration of DB-GPT: Next Generation Data Interaction System Empowered by Large Language Models — VLDB'24 Demo

Data Preparation

Data Cleaning

LLMClean: Context-Aware Tabular Data Cleaning via LLM-Generated OFDs — arXiv
HAIPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation — SIGMOD'23

Data Integration

BoostER: Leveraging Large Language Models for Enhancing Entity Resolution — WWW'24
Leveraging Large Language Models for Entity Matching — arXiv
Entity Matching using Large Language Models — EDBT'25
Using ChatGPT for Entity Matching — ADBIS 2023
Cost-Effective In-Context Learning for Entity Resolution — ICDE'24
Schema Matching with Large Language Models: An Experimental Study — VLDB'24
Magneto: Scaling Schema Matching by Combining Small and Large Language Models — arXiv
Large Language Models for JSON Schema Discovery — arXiv
Matchmaker: Self-Improving LLM Programs for Schema Matching — arXiv
LLMs4SchemaDiscovery: A Human-in-the-Loop Workflow for Scientific Schema Discovery using LLMs — arXiv

Data Analysis

L2: Perceive the Environment

At this level, data agents gain environmental perception and tool-invocation capabilities, enabling them to execute bounded sub-tasks and multi-step procedures. While they can follow human-orchestrated workflows, the overall process is still dominated by human direction.

Data Management

ROMAS: A Role-Based Multi-Agent System for Database monitoring and Planning — arXiv
GPTuner: A Manual-Reading Database Tuning System — VLDB'24
D-Bot: Database Diagnosis System using Large Language Models — VLDB'24
Automatic Database Configuration Debugging using Retrieval-Augmented Language Models — SIGMOD'25
R-Bot: An LLM-based Query Rewrite System — VLDB'25
QUITE: A Query Rewrite System Beyond Rules with LLM Agents — arXiv
Revolutionizing Database Q&A with Large Language Models: Comprehensive Benchmark and Evaluation — KDD'25

Data Preparation

Data Cleaning

ChatPipe: Orchestrating Data Preparation Pipelines by Optimizing Human–ChatGPT Interactions — SIGMOD'24 Demo
CleanAgent: Automating Data Standardization with LLM-based Agents — arXiv
AutoDCWorkflow: LLM-based Data Cleaning Workflow Auto-Generation — arXiv
Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines — arXiv
Weak-to-Strong Prompts with Lightweight-to-Powerful LLMs for High-Accuracy, Low-Cost, and Explainable Data Transformation — VLDB'24
Exploring LLM Agents for Cleaning Tabular Machine Learning Datasets — ICLR'25 Workshop on Foundation Models in the Wild
AutoPrep: Natural Language Question-Aware Data Preparation with a Multi-Agent Framework — VLDB'25

Data Integration

Match, Compare, or Select? An Investigation of Large Language Models for Entity Matching — COLING'25
Agent-OM: Leveraging LLM Agents for Ontology Matching — VLDB'25
Knowledge Graph-based Retrieval-Augmented Generation for Schema Matching — arXiv

Data Analysis

TableQA (Structured)

Binder: Binding Language Models in Symbolic Languages — ICLR'23
StructGPT: A General Framework for Large Language Model on Structured Data — EMNLP'23
ReAcTable: Enhancing ReAct for Table Question Answering (VLDB'24) — VLDB'24
Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning — ACL'25
Large Language Models are Versatile Decomposers: Decomposing Evidence and Questions for Table-based Reasoning — SIGIR'23
Chain-of-Table: Evolving Tables in the Reasoning Chain for Table Understanding — ICLR'24
AutoTQA: Towards Autonomous Tabular Question Answering through Multi-Agent Large Language Models — PVLDB'24
TableMaster: A Recipe to Advance Table Understanding with Language Models — arXiv

NL2SQL (Structured)

CHASE-SQL: Multi-Path Reasoning and Preference Optimization for Text-to-SQL — ICLR'25
MAC-SQL: A Multi-Agent Collaborative Framework for Text-to-SQL — COLING'25
Alpha-SQL: Zero-Shot Text-to-SQL using Monte Carlo Tree Search — arXiv
ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Consensus Enforcement, and Column Exploration — arXiv
ChatBI: Towards Natural Language to Complex Business Intelligence SQL — arXiv

NL2VIS (Structured)

MatPlotAgent: Method and Evaluation for LLM-Based Agentic Scientific Data Visualization — arXiv
DeepVIS: Bridging Natural Language and Data Visualization Through Step-wise Reasoning — VIS'25
C2: Scalable Auto-Feedback for LLM-based Chart Generation — NAACL'25
PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback — WWW'25
nvAgent: Automated Data Visualization from Natural Language via Collaborative Agent Workflow — ACL'25

Data Storytelling

Unstructured

QUEST: Query Optimization in Unstructured Document Analysis — VLDB'25

Data System

Semantic Operators: A Declarative Model for Rich, AI-Augmented Data Analytics — arXiv
LAMBDA: A Large Model Based Data Agent — arXiv
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks — ICML'24

Proto-L3: Striving for Autonomous Data Agents

Level L3 marks a critical transition where data agents evolve from procedural executors into autonomous directors for data-related tasks. At this stage, they can independently decompose high-level goals, orchestrate and optimize tailored, end-to-end pipelines, shifting the human to a supervisory role. While recent pioneering efforts show promise, they are largely considered very early-stage "Proto-L3" systems. Consequently, the pursuit of more autonomous, reliable, versitile and comprehensive L3 data agents remains a key objective in both academia and industry.

Data Management

DBAIOps: A Reasoning LLM-Enhanced Database Operation and Maintenance System using Knowledge Graphs — arXiv
GaussMaster: An LLM-based Database Copilot System — arXiv

Data Analysis

Unstructured

Unify: A System For Unstructured Data Analytics — ICDE'25
DataMosaic: Explainable and Verifiable Multi-Modal Data Analytics through Extract-Reason-Verify (arXiv'25) — arXiv

Data System

Chat2Data: An Interactive Data Analysis System with RAG, Vector Databases and LLMs — VLDB'24 Demo
Data Interpreter: An LLM Agent For Data Science — ACL'25 Findings
AOP: Automated and Interactive LLM Pipeline Orchestration for Answering Complex Queries — CIDR'25
Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing — CIDR'25
AgenticData: An Agentic Data Analytics System for Heterogeneous Data — arXiv
SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science — arXiv
Data Agent: A Holistic Architecture for Orchestrating Data+AI Ecosystems — arXiv

L4-L5: Vision towards Proactive and Generative Data Agents

L4: Vision of Proactive Data Agents

Data agents at L4 can achieve sustained self-governance with proactive monitoring and optimization across the data lifecycle. They can operate autonomously for extended periods without human supervision, actively providing insights and feedback while maintaining reliability.

L5: The Ultimate Vision of Ubiquitous and Generative Data Agents

The ultimate vision of fully autonomous data agents that can function as expert data scientists, capable of knowledge creation and paradigm innovation for data-related tasks.

Survey, Tutorial and Benchmark

Survey

Natural Language Interfaces for Tabular Data Querying and Visualization: A Survey — TKDE'24
A Survey of Text-to-SQL in the Era of LLMs: Where are we, and where are we going? — TKDE'25
A Survey of LLM x DATA — arXiv
Large Language Models for Data Science: A Survey — arXiv
A Survey on Large Language Model-based Agents for Statistics and Data Science — arXiv
Large Language Model-based Data Science Agent: A Survey — arXiv
Large Language Models for Data Discovery and Integration: Challenges and Opportunities — IEEE Data Eng. Bull. 2025
Large Language Models for Data Annotation and Synthesis: A Survey — EMNLP'24
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems — arXiv

Tutorial

Data+AI: LLM4Data and Data4LLM — SIGMOD'25 Tutorial
LLM for Data Management — VLDB'24 Tutorial

Benchmark

DABstep: Data Agent Benchmark for Multi-step Reasoning — arXiv
FDABench: A Benchmark for Data Agents on Analytical Queries over Heterogeneous Data — arXiv

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
.gitignore		.gitignore
README.md		README.md

HKUSTDial/awesome-data-agents

Folders and files

Latest commit

History

Repository files navigation

Awesome Data Agents

Contents

Introduction

Levels of Data Agents

Paper List

L0-L1: From Manual Labor to Preliminary Assistance

Data Management

Data Preparation

Data Cleaning

Data Integration

Data Analysis

TableQA (Structured)

NL2SQL (Structured)

NL2VIS (Structured)

Unstructured

Data Storytelling

L2: Perceive the Environment

Data Management

Data Preparation

Data Cleaning

Data Integration

Data Analysis

TableQA (Structured)

NL2SQL (Structured)

NL2VIS (Structured)

Data Storytelling

Unstructured

Data System

Proto-L3: Striving for Autonomous Data Agents

Data Management

Data Analysis

Unstructured

Data System

L4-L5: Vision towards Proactive and Generative Data Agents

L4: Vision of Proactive Data Agents

L5: The Ultimate Vision of Ubiquitous and Generative Data Agents

Survey, Tutorial and Benchmark

Survey

Tutorial

Benchmark

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages