This repository contains the official implementation for the paper: EverMemModel.
Large Language Models (LLMs) struggle in knowledge-intensive domains that require deep, specialized knowledge. While Retrieval-Augmented Generation (RAG) is a common solution, its decoupled retrieve-then-read pipeline suffers from misaligned objectives and is prone to performance degradation from distractor documents. We propose EverMemModel, a unified, end-to-end trainable memory model. EverMemModel can handle memory contexts on the scale of 100 million tokens. Our model achieves state-of-the-art results on both retrieval and question-answering benchmarks, significantly outperforming traditional RAG pipelines and long-context models.
- End-to-End Memory Model: We propose EverMemModel, a unified architecture that seamlessly integrates retrieval and generation, moving beyond the limitations of decoupled RAG systems.
- State-of-the-Art Performance: EverMemModel achieves SOTA performance on both the retrieval benchmark(NQ320k) and the question-answering task(MS MARCO and TriviaQA).
- Massive-Scale Context: Thanks to its efficient architecture, EverMemModel is one of the first models capable of handling contexts up to 100M tokens.
EverMemModel sets a new state of the art on retrieval task. The best result is in bold.
| Method | NQ320K (Full text) |
|---|---|
| R@1 | |
| Sparse retrieval | |
| BM25 (Robertson & Zaragoza, 2009b) | 29.7 |
| DocT5Query (Nogueira et al., 2019) | 38.0 |
| Dense retrieval | |
| DPR (Karpukhin et al., 2020b) | 50.2 |
| ANCE (Xiong et al., 2021) | 50.2 |
| GTR-Base (Ni et al., 2021) | 56.0 |
| Sentence-T5 (Ni et al., 2022) | 53.6 |
| HCE-J (Chen et al., 2025) | 71.2 |
| Qwen3-Embedding-0.6B (Zhang et al., 2025) | 54.0 |
| Qwen3-Embedding-4B (Zhang et al., 2025) | 62.6 |
| Generative retrieval | |
| DSI-QG (Zhuang et al., 2022) | 63.1 |
| NCI (Wang et al., 2022) | 66.4 |
| GenRet (Sun et al., 2023) | 68.1 |
| Self Retrieval (Tang et al., 2024) | 73.3 |
| Ours (EverMemModel) | 75.5 |
EverMemModel significantly outperforms both strong RAG baselines and large-context models.
| Dataset | Docs | Qwen3RAG-QA | Gemini-2.5-Flash | EverMemModel (Ours) | ||
|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | ||||
| MS MARCO (0.8M Tokens) | 8,389 | 2.235 | 2.535 | 2.548 | 2.710 | 3.812 |
| MS (7.1M Tokens) | 75,574 | 2.225 | 2.521 | 2.759 | N/Aβ | 2.774 |
| TriviaQA (0.87M Tokens) | 607 | 3.69 | 4.10 | 4.36 | 3.29 | 4.53 |
| TriviaQA (8.71M Tokens) | 5,721 | 3.27 | 3.53 | 3.86 | N/Aβ | 4.22 |
β Input exceeds the model's maximum context length.
