Mega: Moving Average Equipped Gated Attention

This is the PyTorch implementation of the Mega paper. This folder is based on the fairseq package v0.9.0.

Mega: Moving Average Equipped Gated Attention

Xuezhe Ma*, Chunting Zhou*, Xiang Kong, Junxian He, Liangke Gui, Graham Neubig, Jonathan May, Luke Zettlemoyer

Setup

This repository requires Python 3.8+ and Pytorch 1.11+.

# Install from this repo
pip install -e .

For faster training, install NVIDIA's apex library following fairseq.

Examples

Models Checkpoints

Task	Description	# params	Download
`LRA`	Mega on LRA tasks	--	mega.lra.zip
`WMT'14 (En-De)`	Mega-base on WMT'14 En-De	67M	meta.wmt14ende.base.zip
`WMT'14 (De-En)`	Mega-base on WMT'14 De-En	67M	meta.wmt14deen.base.zip
`SC-Raw`	Mega-base/big on raw Speech Commands	300k	meta.sc.zip
`WikiText-103`	Language modeling on WikiText-103	252M	meta.wiki103.zip
`Enwiki8`	Language modeling on Enwiki8	39M	meta.enwiki8.zip

Experiments

Code Overview

Mega layer is implemented in fairseq/modules/mega_layer.py.
Mega encoder (LRA) is implemented in fairseq/models/lra/mega_lra_encoder.py.
Mega decoder (LM) is implemented in fairseq/models/mega_lm.py.
Mega encoder-decoder (NMT) is implemented in fairseq/models/mega.py.

Tips

Models are trained with float32, as at the time of development, fft and rfft with fp16 were not supported by pytorch 1.11.0. We'll try to use fp16 with the new version of pytorch.
If you'd like to apply Mega to your task/data, besides architecture size, hyperparameters that are worth considering and tuning include learning rate (lr) and weight decay (wd). We find that tuning wd is a more effective regularization to Mega (in contrast to tuning dropout rates for Transformers). Suggested wd values include 0.01, 0.05, 0.1 (larger models typically need larger wd, please refer to appendix of our paper for hyperparameters we have used). For lr scheduler, linear lr decay and cosine lr decay schedules are more effective than the inverse square root decay scheduler for Mega.

License

mega is under Attribution-NonCommercial 4.0 license. The license applies to model checkpoints as well.

Citation

@article{ma2022mega,
  title={Mega: Moving Average Equipped Gated Attention},
  author={Ma, Xuezhe and Zhou, Chunting and Kong, Xiang and He, Junxian and Gui, Liangke and Neubig, Graham and May, Jonathan and Zettlemoyer Luke},
  journal={arXiv preprint arXiv:2209.10655},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
examples		examples
fairseq		fairseq
fairseq_cli		fairseq_cli
scripts		scripts
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
hubconf.py		hubconf.py
pyproject.toml		pyproject.toml
setup.py		setup.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mega: Moving Average Equipped Gated Attention

Setup

Examples

Models Checkpoints

Experiments

Code Overview

Tips

License

Citation

About

Releases

Packages

Languages

License

amobular/mega

Folders and files

Latest commit

History

Repository files navigation

Mega: Moving Average Equipped Gated Attention

Setup

Examples

Models Checkpoints

Experiments

Code Overview

Tips

License

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages