Skip to content

[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule

License

Notifications You must be signed in to change notification settings

NVlabs/GatedDeltaNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gated Delta Networks: Improving Mamba2 with Delta Rule

nvidia-deltanet-badge

Official PyTorch implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule (ICLR '25).

Star on GitHub

Songlin Yang, Jan Kautz and Ali Hatamizadeh.

For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing

For additional functionalities, such as varlen training and inference support, see FLA implementation.

🌟 Why Gated DeltaNet?

Gated DeltaNet introduces a novel approach to linear transformers by combining:

  • 🧠 Smart Memory Management: Intelligent memory management that knows what to keep and what to forget
  • Precise Updates: Targeted memory updates that enhance model efficiency
  • 💻 Hardware Efficiency: Optimized implementation for real-world deployment

Architecture Overview

Efficiency

Gated DeltaNet shows exceptional performance in terms of training throughput compared to models like Mamba2 and Samba:

Language Modeling and Reasoning

Our model outperforms competitors of various types(e.g. Transformer, RNN, hybrid) in terms of perplexity and zero-shot accuracy on reasoning benchmarks:

Long-context

Gated DeltaNet also achieves favorable perplexity scores on long-context benchmarks:

📢 Latest Updates

  • 01/22/2024: 🔥 Gated DeltaNet has been accepted to ICLR '25.
  • 12/09/2024: 🔥 Code Release: Train your own Gated DeltaNet on Slimpajama dataset
  • Watch this space for more exciting updates!

🚀 Getting Started

Training Your Model

Launch your training with our streamlined command:

python ../pretrain.py \
--train_data_dir ${TRAIN_DATA} \
--val_data_dir ${VALIDATION_DATA} \
--output_root ${SAVE_DIR} \
--exp_name ${NAME} \
--model_name ${MODEL} \
--train_config ${CONFIG} \
--eval_iters ${EVAL_ITERS} \
--learning_rate ${LR} \
--micro_batch_size ${MICRO_BATCH_SIZE}

💡 Pro Tip: Add --interactive_job --debug for interactive debugging sessions!

Please see this slurm script for training the GatedDeltaNet_H1 model with 0.4B parameters on 15B tokens. The training requires 4 nodes and can be finished in approximately 4 hours. For this run, the validation loss and perplexitty curves (1x & 2x for lengh extrapolation) are expected as follows:

curves

📜 License

Copyright © 2024, NVIDIA Corporation. All rights reserved.

Licensed under the NVIDIA Source Code License-NC. See LICENSE for details.

🙏 Acknowledgements

Built on the shoulders of giants:

⭐ Support Us

If you find this work useful, please consider:

  • Starring the repository
  • Citing our paper
  • Contributing to the codebase

Join us in pushing the boundaries of linear transformers! 🚀

Citation

If you find Gated DeltaNet to be useful for your work, please consider citing our paper:

@inproceedings{yang2025gated,
title={Gated Delta Networks: Improving Mamba2 with Delta Rule},
author={Songlin Yang and Jan Kautz and Ali Hatamizadeh},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=r8H7xhYPwz}
}

Star History

Stargazers repo roster for @NVlabs/GatedDeltaNet

Star History Chart

About

[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published