⚡ AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

This is the official repository for "AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning".

🚀 Roadmap

2026-01: AudioGenie-Reasoner has been accepted by ICASSP 2026. The code has been released!
2025-09: AudioGenie-Reasoner is released on arXiv.

✨ Abstract

Audio deep reasoning is a challenging task that requires expert-level perception, multi-step logical inference, and the integration of contextual knowledge. However, existing models suffer from a gap between audio perception and reasoning abilities due to the lack of training data with explicit reasoning chains and the absence of mechanisms for active exploration and iterative refinement. To address these challenges, we propose AudioGenie-Reasoner (AGR), the first unified training-free multi-agent system that coordinates perception and reasoning over an evolving chain of textual evidence. Our key idea is a paradigm shift that transforms audio deep reasoning into complex text understanding task from a new perspective, thereby unlocking the full potential of large language models. Specifically, the design of AGR mimics the human coarse-to-fine cognitive process. It first transforms the input audio into a coarse text-based document. Then, we design a novel proactive iterative document refinement loop, featuring tool-augmented routes and specialized agents, to continuously search for missing information and augment the evidence chain in a coarse-to-fine manner until sufficient question-related information is gathered for making final predictions. Experimental results show that AGR achieves state-of-the-art (SOTA) performance over existing open-source audio deep reasoning models across various benchmarks.

Performance comparison of AudioGenie-Reasoner with other audio reasoning models.

✨ Method

Overview of the AudioGenie Framework.

🎯 Code

Set your API key for GPT-4o in scripts/run_reasoner.py (or export it as an env var).

# line 69
YOUR_API_KEY = "" # input your api key here

Run the inference script:

python scripts/run_reasoner.py \
  --base_dir LLM/benchmark/MMAR \
  --input_json MMAR-meta-new.json \
  --audio_dir data/audio \
  --output_dir LLM/dasheng-lm/result \
  --output_json dasheng-MMAR-multiagent.json \
  --model_id mispeech/midashenglm-7b \
  --deepseek_model gpt-4o-2024-08-06 \
  --max_iters 3

📭 Contact

If you have any comments or questions, feel free to contact me (yrong854@connect.hkust-gz.edu.cn).

📚 Citation

If you find our work useful, please consider citing:

@article{rong2025audiogenie,
  title={AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning},
  author={Rong, Yan and Li, Chenxing and Yu, Dong and Liu, Li},
  journal={arXiv preprint arXiv:2509.16971},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
audio_reasoner		audio_reasoner
pic		pic
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

🚀 Roadmap

✨ Abstract

✨ Method

🎯 Code

📭 Contact

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ AudioGenie-Reasoner: A Training-Free Multi-Agent Framework for Coarse-to-Fine Audio Deep Reasoning

🚀 Roadmap

✨ Abstract

✨ Method

🎯 Code

📭 Contact

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages