We propose Rodimus*, including Rodimus and Rodimus+, which tries to break the accuracy-efficency trade-off existing in Vanilla tranformers by introducing several innovative features.
Rodimus:
- Linear attention-based, purely recurrent model.
- Incorporates Data-Dependent Tempered Selection (DDTS) for semantic compression.
- Reduced memory usage.
Rodimus+:
- Hybrid model combining Rodimus with Sliding Window Shared-Key Attention (SW-SKA).
- Enhances semantic, token, and head compression.
Rodimus+-Coder:
- We train and open-source the lightweight Rodimus+-Coder model, available in 1.6B and 4B sizes, achieving performance surpassing SOTA models of similar sizes.
- Constant memory footprint but better language modeling performance.
- Better scaling performance than Transformer.
- A real lite model, without memory complexity O(T) in KV cache.
This checkpoints completed training before submitting the paper, used to reproduce the benchmarks in the paper.
If you want to use the more practical model, we strongly recommand you to download the checkpionts in Rodimus+-Coder.
Model (2024/10/01) | #Total Params | Training Tokens | Context Length | Download |
---|---|---|---|---|
Rodimus-1.4B-Base | 1.4B | 500B | 2K | 🤗 HuggingFace 🤖 ModelScope |
Rodimus+-1.6B-Base | 1.6B | 1T | 2K | 🤗 HuggingFace 🤖 ModelScope |
Rodimus+-Coder-1.6B-Base-20241001 | 1.6B | 2.5T | 4K | 🤗 HuggingFace 🤖 ModelScope |
The Rodimus+-Coder-1.6B-Base-20241001
is the model enhanced by multi-stage training with math and code datasets in the paper.
You can download the following table to see the various parameters for your use case. If you are located in mainland China, we also provide the model on modelscope.cn to speed up the download process.
Model | #Total Params | Training Tokens | Context Length | Download |
---|---|---|---|---|
Rodimus+-Coder-1.6B-Base | 1.6B | 8.2T | 4K | 🤗 HuggingFace 🤖 ModelScope |
Rodimus+-Coder-1.6B-Chat | 1.6B | - | 4K | 🤗 HuggingFace 🤖 ModelScope |
Rodimus+-Coder-4B-Base | 4B | 8.2T | 4K | 🤗 HuggingFace 🤖 ModelScope |
Rodimus+-Coder-4B-Chat | 4B | - | 4K | 🤗 HuggingFace 🤖 ModelScope |
We re-evaluate the metrics of the Qwen series models, and the metrics of other series models are quoted from the original paper. For detailed evaluation code, please refer to the evaluation method of Ling-Coder-Lite in CodeFuse-Evaluation.
Datasets | Qwen2.5-Coder-1.5B | Rodimus+-Coder-1.6B-Base | Gemma2-2B-PT | Qwen2.5-Coder-3B | Rodimus+-Coder-4B-Base | Gemma3-4B-PT | Qwen2.5-Coder-7B |
---|---|---|---|---|---|---|---|
Coding Tasks | |||||||
HumanEval | 41.5 | 51.2 | 19.5 | 51.8 | 60.4 | 36.0 | 60.4 |
HumanEval+ | 34.8 | 45.1 | - | 40.9 | 52.4 | - | 50.6 |
MBPP | 57.2 | 51.2 | 31.0 | 62.6 | 64.6 | 46.0 | 70.0 |
MBPP+ | 66.1 | 62.2 | - | 65.9 | 71.4 | - | 70.1 |
BCBCOMPLETION | 21.6 | 17.9 | - | 26.2 | 30.8 | - | 30.4 |
MultiPL-E | 46.1 | 52.5 | - | 49.4 | 60.7 | - | 56.9 |
CRUXEval | 38.5 | 45.1 | - | 44.6 | 56.4 | - | 56.8 |
Coding Avg. | 43.7 | 46.5 | - | 48.8 | 56.7 | - | 56.4 |
General Tasks | |||||||
C-EVAL | 55.2 | 56.7 | - | 65.3 | 70.2 | - | 69.1 |
CMMLU | 54.5 | 52.3 | - | 65.4 | 68.3 | - | 72.7 |
MMLU | 55.5 | 51.1 | 52.2 | 63.3 | 62.6 | 59.6 | 70.5 |
BBH | 21.8 | 46.8 | 42.4 | 32.5 | 61.9 | 50.9 | 67.3 |
General Avg. | 46.8 | 51.7 | - | 56.6 | 65.8 | - | 69.9 |
Mathematics Tasks | |||||||
GSM8K | 60.4 | 68.7 | 25.0 | 72.1 | 78.5 | 38.4 | 83.4 |
MATH | 23.7 | 29.0 | 16.4 | 31.9 | 37.0 | 24.2 | 42.2 |
Math Avg. | 41.9 | 48.9 | 20.7 | 52.0 | 57.8 | 31.3 | 62.8 |
Overall | |||||||
Overall | 44.4 | 48.4 | - | 51.7 | 59.6 | - | 61.6 |
Datasets | Qwen2.5-Coder-1.5B-Instruct | Rodimus+-Coder-1.6B-Chat | Gemma2-2B-IT | Qwen2.5-Coder-Instruct | Phi-4-Mini-3.8B | Rodimus+-Coder-4B-Chat | Gemma3-4B-IT | Qwen2.5-Coder-7B-Instruct |
---|---|---|---|---|---|---|---|---|
Coding Tasks | ||||||||
HumanEval | 64.6 | 76.8 | 20.1 | 79.9 | 74.4 | 86.6 | 71.3 | 87.2 |
HumanEval+ | 63.4 | 73.8 | - | 80.5 | 68.3 | 82.9 | - | 82.3 |
MBPP | 51.0 | 59.0 | 36.6 | 59.2 | 65.3 | 68.0 | 63.2 | 75.8 |
MBPP+ | 53.0 | 66.4 | - | 61.9 | 63.8 | 68.5 | - | 75.1 |
LCB(24.08-24.11) | 4.0 | 10.9 | - | 13.0 | - | 13.9 | - | 22.8 |
BCBINSTRUCT | 10.8 | 21.5 | - | 21.7 | 33.8 | 26.6 | - | 30.6 |
HumanEval-Mul | 50.8 | 57.3 | - | 67.4 | - | 70.6 | - | 76.1 |
MBPP-Mul | 43.4 | 52.4 | - | 53.4 | - | 59.6 | - | 61.4 |
MBXP-EN | 55.8 | 75.5 | - | 76.0 | - | 87.3 | - | 87.7 |
MBXP-CN | 48.8 | 75.0 | - | 68.7 | - | 84.3 | - | 83.5 |
CRUXEval | 28.6 | 55.0 | - | 51.6 | - | 63.2 | - | 69.3 |
HumanEvalFix | 38.9 | 52.6 | - | 55.5 | - | 68.8 | - | 69.3 |
Spider | 61.2 | 71.4 | - | 71.8 | 42.2 | 73.5 | - | 82.0 |
Coding Avg. | 44.2 | 57.5 | - | 58.5 | - | 65.7 | - | 69.5 |
General Tasks | ||||||||
C-EVAL | 51.5 | 50.8 | - | 62.0 | - | 61.6 | - | 66.4 |
CMMLU | 45.2 | 50.5 | - | 60.1 | - | 62.0 | - | 64.9 |
MMLU | 52.0 | 49.3 | 56.1 | 61.7 | 67.3 | 57.5 | 58.1 | 66.1 |
BBH | 24.2 | 58.7 | 41.4 | 57.3 | 70.4 | 63.7 | 72.2 | 59.1 |
General Avg. | 43.2 | 52.3 | - | 60.3 | - | 61.2 | - | 64.1 |
Mathematics Tasks | ||||||||
GSM8K | 54.4 | 68.5 | 62.6 | 73.5 | 88.6 | 79.2 | 89.2 | 79.5 |
MATH | 38.1 | 33.5 | 27.2 | 44.1 | 64.0 | 44.1 | 75.6 | 60.8 |
Math Avg. | 46.2 | 51.0 | 44.9 | 58.8 | 68.8 | 61.7 | 82.4 | 70.1 |
Overall | ||||||||
Overall | 44.2 | 55.8 | - | 58.9 | - | 64.3 | - | 68.4 |
- The latest version of
transformers
is recommended (at least 4.42.0). - We evaluate our models with
python=3.8
andtorch==2.1.2
. - If you use Rodimus, you need to install
flash-linear-attention
,causal_conv1d
andtriton>=2.2.0
. If you use Rodimus+, you need to further installflash-attention
.
In examples/generation_script.py
, we show a code snippet to show you how to use the model to generate:
import os
import torch
from modeling_rodimus import RodimusForCausalLM
from tokenization_rodimus_fast import RodimusTokenizer
# load model
ckpt_dir = "model_path"
tokenizer = RodimusTokenizer.from_pretrained(ckpt_dir)
model = RodimusForCausalLM.from_pretrained(
ckpt_dir,
torch_dtype=torch.float16,
device_map="cuda"
).eval()
# inference
input_prompt = "你好!你是谁?"
model_inputs = tokenizer(input_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**model_inputs, max_length=32)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
In examples/chat_script.py
, we further show how to chat with Rodimus+:
import os
import torch
from modeling_rodimus import RodimusForCausalLM
from tokenization_rodimus_fast import RodimusTokenizer
# load model
ckpt_dir = "model_path"
tokenizer = RodimusTokenizer.from_pretrained(ckpt_dir)
model = RodimusForCausalLM.from_pretrained(
ckpt_dir,
torch_dtype=torch.float16,
device_map="cuda"
).eval()
# inference
input_prompt = "简单介绍一下大型语言模型。"
messages = [
{"role": "HUMAN", "content": input_prompt}
]
text = tokenizer.apply_chat_template(
messages,
system='You are Rodimus$+$, created by AntGroup. You are a helpful assistant.',
tokenize=False,
)
print(text)
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**model_inputs, max_length=2048)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)
If you find our work helpful, feel free to give us a cite.
@inproceedings{
he2025rodimus,
title={Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions},
author={Zhihao He and Hang Yu and Zi Gong and Shizhan Liu and Jianguo Li and Weiyao Lin},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=IIVYiJ1ggK}
}