🎶 AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation

This is the official repository for "AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation".

🚀 News

2025-10: MA-Bench has been released!
2025-07: AudioGenie has been accepted by ACM MM 2025! We look forward to seeing you in Dublin, Ireland!

✨ Abstract

Multimodality-to-Multiaudio (MM2MA) generation faces significant challenges in synthesizing diverse and contextually aligned audio types (e.g., sound effects, speech, music, and songs) from multimodal inputs (e.g., video, text, images), owing to the scarcity of high-quality paired datasets and the lack of robust multi-task learning frameworks. Recently, multi-agent system shows great potential in tackling the above issues. However, directly applying it to MM2MA task presents three critical challenges: (1) inadequate fine-grained understanding of multimodal inputs (especially for video), (2) the inability of single models to handle diverse audio events, and (3) the absence of self-correction mechanisms for reliable outputs. To this end, we propose AudioGenie, a novel training-free multi-agent system featuring a dual-layer architecture with a generation team and a supervisor team. For the generation team, a fine-grained task decomposition and an adaptive Mixture-of-Experts (MoE) collaborative entity are designed for detailed comprehensive multimodal understanding and dynamic model selection, and a trial-and-error iterative refinement module is designed for self-correction. The supervisor team ensures temporal-spatial consistency and verifies outputs through feedback loops. Moreover, we build MA-Bench, the first benchmark for MM2MA tasks, comprising 198 annotated videos with multi-type audios. Experiments demonstrate that our AudioGenie achieves state-of-the-art (SOTA) or comparable performance across 9 metrics in 8 tasks. User study further validates the effectiveness of our method in terms of quality, accuracy, alignment, and aesthetic.

✨ Method

Overview of the AudioGenie Framework.

🔮 MA-Bench

The dataset has been released on Hugging Face.

Statistics of video categories within our MA-Bench.

🛠️ Environment Setup

Create Anaconda Environment:

git clone https://github.com/ryysayhi/AudioGenie.git
cd AudioGenie
conda create -n AudioGenie python=3.10
conda activate AudioGenie
pip install -r requirements.txt

Install ffmpeg:
```
sudo apt-get install ffmpeg
```

📀 Establish Tool Library

In the /bin folder, we provide four examples: MMAudio, CosyVoice, InspireMusic, DiffRhythm. You can clone each project and install it following its own guide. Then set:

export MMAUDIO_HOME=<PATH_TO_MMAUDIO>
export COSYVOICE_HOME=<PATH_TO_COSYVOICE>
export INSPIREMUSIC_HOME=<PATH_TO_INSPIREMUSIC>
export DIFFRHYTHM_HOME=<PATH_TO_DIFFRHYTHM>

export MMAUDIO_CONDA=mmaudio
export COSYVOICE_CONDA=cosyvoice
export INSPIREMUSIC_CONDA=inspiremusic
export DIFFRHYTHM_CONDA=diffrhythm

To extend the library, add your preferred speech / song / music / sound-effect models by defining a ToolSpec in tools.py, and add a matching run_model.py in /bin.

🎯 Infer

We use Gemini as the MLLM in this repo. You can swap it for another MLLM (e.g., Qwen2.5-VL, which we used in the paper).

Set your API key for Gemini in run.py (or export it as an env var):

os.environ['GEMINI_API_KEY'] = 'Your_Gemini_Api_Key'
# or in shell:
# export GEMINI_API_KEY=Your_Gemini_Api_Key

Run the inference script:

python AudioGenie/run.py \
  --video <PATH_TO_VIDEO or omit> \
  --image <PATH_TO_IMAGE or omit> \
  --text  "<YOUR_TEXT or omit>" \
  --outdir <OUTPUT_DIR>

📭 Contact

If you have any comments or questions, feel free to contact me (yrong854@connect.hkust-gz.edu.cn).

📚 Citation

If you find our work useful, please consider citing:

@article{rong2025audiogenie,
  title={AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation},
  author={Rong, Yan and Wang, Jinting and Lei, Guangzhi and Yang, Shan and Liu, Li},
  journal={arXiv preprint arXiv:2505.22053},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
bin		bin
pic		pic
utils		utils
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
agents.py		agents.py
critiquers.py		critiquers.py
experts.py		experts.py
llm.py		llm.py
mixer.py		mixer.py
plan.py		plan.py
requirements.txt		requirements.txt
run.py		run.py
tools.py		tools.py
tot.py		tot.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎶 AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation

🚀 News

✨ Abstract

✨ Method

🔮 MA-Bench

🛠️ Environment Setup

📀 Establish Tool Library

🎯 Infer

📭 Contact

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🎶 AudioGenie: A Training-Free Multi-Agent Framework for Diverse Multimodality-to-Multiaudio Generation

🚀 News

✨ Abstract

✨ Method

🔮 MA-Bench

🛠️ Environment Setup

📀 Establish Tool Library

🎯 Infer

📭 Contact

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages