Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Abstract

We introduce M3-Agent, a novel multimodal agent framework equipped with long-term memory. Like humans, M3-Agent can process real-time visual and auditory inputs to build and update its long-term memory. Beyond episodic memory, it also develops semantic memory, enabling it to accumulate world knowledge over time. Its memory is organized in an entity-centric, multimodal format, allowing deeper and more consistent understanding of the environment. Given an instruction, M3-Agent autonomously performs multi-turn, iterative reasoning and retrieves relevant information from memory to accomplish the task. To evaluate memory effectiveness and memory-based reasoning in multimodal agents, we develop M3-Bench, a new long-video question answering benchmark. M3-Bench comprises 100 newly recorded real-world videos captured from a robot’s perspective (M3-Bench-robot) and 920 web-sourced videos across diverse scenarios (M3-Bench-web). We annotate question-answer pairs designed to test key capabilities essential for agent applications, such as human understanding, general knowledge extraction, and cross- modal reasoning. Experimental results show that M3-Agent, trained via reinforcement learning, outperforms the strongest baseline, a prompting agent using Gemini-1.5-pro and GPT-4o, achieving 8.2%, 7.7%, and 5.3% higher accuracy on M3-Bench-robot, M3-Bench-web and VideoMME-long, respectively. Our work advances the multimodal agents toward more human-like long-term memory and provides insights into their practical design.

A demo of M3-Agent as a personal assistant!

The video can also be accessed on Bilibili

M3-Bench

We introduce M3-Bench, an long video question answerin dataset designed to evaluate the capability of multimodal agents to perform reasoning over long-term memory. Each instance in M3-Bench comprises a long video simulating the perceptual input of an agent, along with a series of open-ended question-answer pairs. The dataset is organized into two subsets:

M3-Bench-robot, which contains 100 real-world videos recorded from a robot's first-person perspective,
M3-Bench-web, which includes 920 web-sourced videos covering a wider variety of content and scenarios.

link1, link2, link3
Examples from M3-Bench. M3-Bench-robot features long videos from realistic robotic work scenarios, while M3-Bench-web expands the video diversity to support broader evaluation. The question-answering tasks are designed to assess a multimodal agent’s ability to construct consistent and reliable long-term memory, as well as to reason effectively over that memory.

Statistical overview of M3-Bench benchmark. Each question may correspond to multiple question types.

Videos

Download M3-Bench-robot from huggingface
Download M3-Bench-web from video_url in data/annotations/web.json\

Intermediate Outputs

[optional] You can either download the intermediate outputs we have processed from huggingface or generate them directly from the video by the following steps.

Memory Graphs

[optional] You can either download and extract the memory graphs we have processed from huggingface or generate them directly from the video by the following steps.

M3-Agent

Architecture of M3-Agent. The system consists of two parallel processes: memorization and control. During memorization, M3-Agent processes video and audio streams online to generate episodic and semantic memory. During control, it executes instructions by iteratively thinking and retrieving from long-term memory. The long-term memory is structured as a multimodal graph.

Experimental Results

Results on M3-Bench-robot, M3-Bench-web, and VideoMME-long.

Run Locally

Before running, add api config in configs/api_config.json

Memorization

Generate memory graphs for each video. The results are saved in data/memory_graphs.

The following steps are required only if you haven't downloaded intermediate_outputs and memory_graphs from huggingface or want to process other videos not from M3-Bench.

Set up environment

bash setup.sh
pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
pip install qwen-omni-utils==0.0.4

Cut Video

Cut the video into 30 second segments.

#!/bin/bash

video="robot/bedroom_01"
input="data/videos/$video.mp4"
mkdir -p "data/clips/$video"
duration=$(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "$input")
duration_seconds=$(echo "$duration" | awk '{print int($1)}')
 
segments=$((duration_seconds / 30 + 1))
for ((i=0; i<segments; i++)); do
    start=$((i * 30))
    end=$(((i + 1) * 30))
    output="data/clips/$video/$i.mp4"
    ffmpeg -ss $start -i "$input" -t 30 -c copy "${output}"
done

Prepare data

Prepare a jsonl file with one video per line saved in data/data.jsonl

{"id": "bedroom_01", "video_path": "data/videos/robot/bedroom_01.mp4", "clip_path": "data/videos/clips/bedroom_01", "mem_path": "data/videos/memory_graphs/bedroom_01.pkl", "intermediate_path": "data/videos/intermediate_outputs/robot/bedroom_01"}

Generate Intermediate Outputs

This step uses Face Detection and Speaker Diarization tools to generate intermediate outputs.
- If you want to use M3-Bench and have downloaded intermediate_outputs from huggingface, you can skip this step.
- Download audio embedding model and save into models\ from pretrained_eres2netv2.ckpt
- Download speakerlab
```
m3-agent
├── models
│   └── pretrained_eres2netv2.ckpt
└── speakerlab
```

python m3_agent/memorization_intermediate_outputs.py \
   --data_file data/data.jsonl

Generate Memory Graphs

This step uses the M3-Agent-Memorization model to generate memory graphs.
- Download M3-Agent-Memorization from huggingface

python m3_agent/memorization_memory_graphs.py \
   --data_file data/data.jsonl

Memory Graph Visualization

python visualization.py \
   --mem_path data/memory_graphs/robot/bedroom_01.pkl \
   --clip_id 1

Control

Set up environment

bash setup.sh
pip install transformers==4.51.0
pip install vllm==0.8.4
pip install numpy==1.26.4

Question Answering and Evaluation

This step uses the M3-Agent-Control model to generate answer and the GPT-4o to evaluate the answer.
- Download M3-Agent-Control from huggingface

python m3_agent/control.py \
   --data_file data/annotations/robot.json

Other Models

If you want to prompt other models to generate memory or answer question, only need to change the model inference into api calling and use the corresponding prompt.

Prompts:

Memorization
- Gemini/GPT-4o: mmagent.prompts.prompt_generate_captions_with_ids
- Qwen2.5-Omni-7B: mmagent.prompts.prompt_generate_full_memory
Control
- GPT-4o: mmagent.prompts.prompt_answer_with_retrieval_final

Training

Memorization: https://github.com/hyc2026/sft-qwen2.5-omni-thinker
Control: https://github.com/hyc2026/M3-Agent-Training

Citation

Please cite us as:

@misc{long2025seeing,
      title={Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory}, 
      author={Lin Long, Yichen He, Wentao Ye, Yiyuan Pan, Yuan Lin, Hang Li, Junbo Zhao, Wei Li},
      year={2025},
      eprint={2508.09736},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
configs		configs
data/annotations		data/annotations
figs		figs
m3_agent		m3_agent
mmagent		mmagent
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.sh		setup.sh
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Abstract

A demo of M3-Agent as a personal assistant!

M3-Bench

Videos

Intermediate Outputs

Memory Graphs

M3-Agent

Experimental Results

Run Locally

Memorization

Control

Other Models

Training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ByteDance-Seed/m3-agent

Folders and files

Latest commit

History

Repository files navigation

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Abstract

A demo of M3-Agent as a personal assistant!

M3-Bench

Videos

Intermediate Outputs

Memory Graphs

M3-Agent

Experimental Results

Run Locally

Memorization

Control

Other Models

Training

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages