We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot’s perspective (M3-Bench-robot) and 920 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross- modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 8.2%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design.
The video can also be accessed on Bilibili
We introduce M3-Bench, an long video question answerin dataset designed to evaluate the capability of multimodal agents to perform reasoning over long-term memory. Each instance in M3-Bench comprises a long video simulating the perceptual input of an agent, along with a series of open-ended question-answer pairs. The dataset is organized into two subsets:
- M3-Bench-robot, which contains 100 real-world videos recorded from a robot's first-person perspective,
- M3-Bench-web, which includes 920 web-sourced videos covering a wider variety of content and scenarios.
link1, link2, link3
Examples from M3-Bench. M3-Bench-robot features long videos from realistic robotic work scenarios, while M3-Bench-web expands the video diversity to support broader evaluation. The question-answering tasks are designed to assess a multimodal agent’s ability to construct consistent and reliable long-term memory, as well as to reason effectively over that memory.
Statistical overview of M3-Bench benchmark. Each question may correspond to multiple question types.
- Download M3-Bench-robot from huggingface
- Download M3-Bench-web from video_url in
data/annotations/web.json
\
[optional] You can either download the intermediate outputs we have processed from huggingface or generate them directly from the video by the following steps.
[optional] You can either download and extract the memory graphs we have processed from huggingface or generate them directly from the video by the following steps.
Architecture of M3-Agent. The system consists of two parallel processes: memorization and control. During memorization, M3-Agent processes video and audio streams online to generate episodic and semantic memory. During control, it executes instructions by iteratively thinking and retrieving from long-term memory. The long-term memory is structured as a multimodal graph.
Results on M3-Bench-robot, M3-Bench-web, and VideoMME-long.
Before running, add api config in
configs/api_config.json
Generate memory graphs for each video. The results are saved in data/memory_graphs
.
- The following steps are required only if you haven't downloaded intermediate_outputs and memory_graphs from huggingface or want to process other videos not from M3-Bench.
- Set up environment
bash setup.sh
pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
pip install qwen-omni-utils==0.0.4
-
Cut Video
Cut the video into 30 second segments.
#!/bin/bash
video="robot/bedroom_01"
input="data/videos/$video.mp4"
mkdir -p "data/clips/$video"
duration=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$input")
duration_seconds=$(echo "$duration" | awk '{print int($1)}')
segments=$((duration_seconds / 30 + 1))
for ((i=0; i<segments; i++)); do
start=$((i * 30))
end=$(((i + 1) * 30))
output="data/clips/$video/$i.mp4"
ffmpeg -ss $start -i "$input" -t 30 -c copy "${output}"
done
- Prepare data
Prepare a jsonl file with one video per line saved in data/data.jsonl
{"id": "bedroom_01", "video_path": "data/videos/robot/bedroom_01.mp4", "clip_path": "data/videos/clips/bedroom_01", "mem_path": "data/videos/memory_graphs/bedroom_01.pkl", "intermediate_path": "data/videos/intermediate_outputs/robot/bedroom_01"}
-
Generate Intermediate Outputs
This step uses Face Detection and Speaker Diarization tools to generate intermediate outputs.
-
If you want to use M3-Bench and have downloaded intermediate_outputs from huggingface, you can skip this step.
-
Download audio embedding model and save into
models\
from pretrained_eres2netv2.ckpt -
Download speakerlab
m3-agent ├── models │ └── pretrained_eres2netv2.ckpt └── speakerlab
-
python m3_agent/memorization_intermediate_outputs.py \
--data_file data/data.jsonl
-
Generate Memory Graphs
This step uses the M3-Agent-Memorization model to generate memory graphs.
- Download M3-Agent-Memorization from huggingface
python m3_agent/memorization_memory_graphs.py \
--data_file data/data.jsonl
- Memory Graph Visualization
python visualization.py \
--mem_path data/memory_graphs/robot/bedroom_01.pkl \
--clip_id 1
- Set up environment
bash setup.sh
pip install transformers==4.51.0
pip install vllm==0.8.4
pip install numpy==1.26.4
-
Question Answering and Evaluation
This step uses the M3-Agent-Control model to generate answer and the GPT-4o to evaluate the answer.
- Download M3-Agent-Control from huggingface
python m3_agent/control.py \
--data_file data/annotations/robot.json
If you want to prompt other models to generate memory or answer question, only need to change the model inference into api calling and use the corresponding prompt.
Prompts:
-
Memorization
- Gemini/GPT-4o:
mmagent.prompts.prompt_generate_captions_with_ids
- Qwen2.5-Omni-7B:
mmagent.prompts.prompt_generate_full_memory
- Gemini/GPT-4o:
-
Control
- GPT-4o:
mmagent.prompts.prompt_answer_with_retrieval_final
- GPT-4o:
- Memorization: https://github.com/hyc2026/sft-qwen2.5-omni-thinker
- Control: https://github.com/hyc2026/M3-Agent-Training
Please cite us as:
@misc{long2025seeing,
title={Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory},
author={Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li},
year={2025},
eprint={2508.09736},
archivePrefix={arXiv},
primaryClass={cs.CV}
}