This repository contains implementations of mixture-of-expert models such as the Switch Transformer (Fedus et al. 2021), exploring the ways in which conditional computation can be exploited to scale model parameter count independently of compute as well as its effects on performance and training time.
For more, please see ROADMAP.md
For points of contact, please directly contact the authors of this repository.