SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai "Helen" Li, Yiran Chen |
![image](/aharshms/Awesome-Efficient-LLM/raw/main/figures/SiDA.png) |
Paper |
![Star](https://camo.githubusercontent.com/6c4418eed01d138d8ef89b305640dc4a82e1b23a4198d3869cf4cb67b938b006/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f64766d617a75722f6d69787472616c2d6f66666c6f6164696e672e7376673f7374796c653d736f6369616c266c6162656c3d53746172) Fast Inference of Mixture-of-Experts Language Models with Offloading Artyom Eliseev, Denis Mazur |
![image](/aharshms/Awesome-Efficient-LLM/raw/main/figures/mixtral_offloading.png) |
Github Paper |
![Star](https://camo.githubusercontent.com/829a4c59c69febd09067e39faa162e6ec466a71ec26747d80c73947fbb08b93f/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f726f6265727463736f726461732f6d6f655f617474656e74696f6e2e7376673f7374796c653d736f6369616c266c6162656c3d53746172) SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber |
![image](/aharshms/Awesome-Efficient-LLM/raw/main/figures/switchhead.png) |
Github Paper |
![Star](https://camo.githubusercontent.com/2193b2def071b0d1fd992892da16ff18a57493ee2ba1a37c81f508d80f64aecd/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f594a484d49545745422f4578466c6f772e7376673f7374796c653d736f6369616c266c6162656c3d53746172) Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK)Panda |
![image](/aharshms/Awesome-Efficient-LLM/raw/main/figures/exflow.png) |
Github Paper |
![Star](https://camo.githubusercontent.com/4f784ab04599339053fc982bb0b2e0917e9f5a65a23c030316bd8471e665c4c5/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f546f7263684d6f452f4d6f452d496e66696e6974792e7376673f7374796c653d736f6369616c266c6162656c3d53746172) MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina |
![image](/aharshms/Awesome-Efficient-LLM/raw/main/figures/MOE-Infinity.png) |
Github Paper |
![Star](https://camo.githubusercontent.com/cfc25d2a4940ffe0d3a40c94e867b46dd72ebc13d5dcae809e2f7f74c034a0ab/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f656665736c61622f666964646c65722e7376673f7374796c653d736f6369616c266c6162656c3d53746172) Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci |
![image](https://github.com/efeslab/fiddler/raw/main/asset/key-idea.png) |
Github Paper |
![Star](https://camo.githubusercontent.com/7b60af42b78877eba64e38b21127b7f8c97c06ac106e065e22e8ed31bdbd2765/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f4c75636b792d4c616e63652f4578706572745f53706172736974792e7376673f7374796c653d736f6369616c266c6162656c3d53746172) Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, Hongsheng Li |
![image](https://camo.githubusercontent.com/27dc31c8c074136dac0df3de7e8d9399545f05998fa30d7fd1500ce2f82b4c33/68747470733a2f2f61727869762e6f72672f68746d6c2f323430322e313438303076312f78322e706e67) |
Github Paper |
Enhancing Efficiency in Sparse Models with Sparser Selection Yuanhang Yang, Shiyi Qi, Wenchao Gu, Chaozheng Wang, Cuiyun Gao, Zenglin Xu |
![image](https://camo.githubusercontent.com/87a340f2a641c86466f8b35d6caa6105faa3fa1d505128953a33bf46844f845b/68747470733a2f2f61727869762e6f72672f68746d6c2f323430332e313839323676312f78332e706e67) |
Github Paper |
![Star](https://camo.githubusercontent.com/c089aa9d3efcdf8c69a519c62ee0a123293fe20acb96c4055109ff285c11531e/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f68646f6e673932302f4752494646494e2e7376673f7374796c653d736f6369616c266c6162656c3d53746172) Prompt-prompted Mixture of Experts for Efficient LLM Generation Harry Dong, Beidi Chen, Yuejie Chi |
![image](https://camo.githubusercontent.com/9e1e630fb9c673a88479a5b107aa542e447079effc8d36cfaaf6460dbc5646bf/68747470733a2f2f61727869762e6f72672f68746d6c2f323430342e303133363576312f6578747261637465642f353530393236332f666967757265732f616c676f726974686d2e706e67) |
Github Paper |
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, Jiayi Huang |
![image](https://camo.githubusercontent.com/cfaf37d884a66041757ba30634b475140f5cc154cef3496d7a705a74f11e3c2c/68747470733a2f2f61727869762e6f72672f68746d6c2f323430342e303530313976312f78312e706e67) |
Paper |
SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts Alexandre Muzio, Alex Sun, Churan He |
![image](https://camo.githubusercontent.com/89a7b5e494526511530346efc1d02757193d1c012d5bd1d294f555bfbe9a013d/68747470733a2f2f61727869762e6f72672f68746d6c2f323430342e303530383976312f78312e706e67) |
Paper |
Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, Rameswar Panda |
![image](https://camo.githubusercontent.com/081d402b642b345514d1c3e506b8583ca864a3c7ad5e39f7962ac051c77e297b/68747470733a2f2f61727869762e6f72672f68746d6c2f323430342e303535363776312f78322e706e67) |
Paper |
![Publish](https://camo.githubusercontent.com/75523d98060122ee08f6e9ac267836460ac9dd74dad301b0044c6767e817f486/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436f6e666572656e63652d4d4c7379732732342d626c7565) Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang |
![image](https://camo.githubusercontent.com/2ab36466e10a6afd1bedeb2e4811b4f2e77a04b8fd9396a1cb4c9762bf9c503b/68747470733a2f2f61727869762e6f72672f68746d6c2f323430342e313934323976312f78342e706e67) |
Paper |
![Publish](https://camo.githubusercontent.com/f1d3135c8bc1af7eb58aacfb9b5d08584280663156317a53a0c09049c4d672fe/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436f6e666572656e63652d49434d4c2732342d626c7565) A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts Mohammed Nowaz Rabbani Chowdhury, Meng Wang, Kaoutar El Maghraoui, Naigang Wang, Pin-Yu Chen, Christopher Carothers |
![image](https://camo.githubusercontent.com/70e57c6e525d8605191a1bcba4283b4638f464de1505f988f1ca7029db28b712/68747470733a2f2f61727869762e6f72672f68746d6c2f323430352e313636343676322f6578747261637465642f353632363430322f4669672f746f6b656e5f6578706572745f636f6d62696e65645f322e706e67) |
Paper |
![Star](https://camo.githubusercontent.com/2fbcadf57ea5b829bbda9ec5b05947f17cc5cf937110d28321ef9456f4966dde/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f4c494e732d6c61622f44796e4d6f452e7376673f7374796c653d736f6369616c266c6162656c3d53746172) Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Tao Lin |
![image](/aharshms/Awesome-Efficient-LLM/raw/main/figures/dynmoe.png) |
Github Paper |
![Publish](https://camo.githubusercontent.com/e65d27b167b26b7240e2145c329b07d77ec84ff5257cf87f9a99ddef690fa01a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f436f6e666572656e63652d4441432732342d626c7565) MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models Taehyun Kim, Kwanseok Choi, Youngmock Cho, Jaehoon Cho, Hyuk-Jae Lee, Jaewoong Sim |
![image](https://camo.githubusercontent.com/c0af9317782fb69e3be23e385cb444183fd85d83d909d5f50195bb4632709de2/68747470733a2f2f61727869762e6f72672f68746d6c2f323430352e313838333276312f78342e706e67) |
Paper |
![Star](https://camo.githubusercontent.com/7b3904230fac2a85cf9af7ccaa42d6ce023cb055101fc3dfe1ca2299e4f51d26/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f4461697a65446f6e672f556e69666965642d4d6f452d436f6d7072657373696f6e2e7376673f7374796c653d736f6369616c266c6162656c3d53746172) Demystifying the Compression of Mixture-of-Experts Through a Unified Framework Shwai He, Daize Dong, Liang Ding, Ang Li |
![image](https://camo.githubusercontent.com/c1a2668e271f48bb73a664211e59edc3038bc9ba3048ea1acf1ffce8d833df44/68747470733a2f2f61727869762e6f72672f68746d6c2f323430362e303235303076312f78312e706e67) |
Github Paper |
ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang |
![image](https://camo.githubusercontent.com/9470d32be0f39e788ff6dbb95489baf227ff256d587c74d884e064a92e356bb9/68747470733a2f2f61727869762e6f72672f68746d6c2f323430362e303930343176312f78312e706e67) |
Paper |
![Star](https://camo.githubusercontent.com/33c9b7c370d2c68de323b6edb7a0dee61045a71c4f05a0cd18be6f84199fb2be/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f554e495445532d4c61622f6d6f652d7175616e74697a6174696f6e2e7376673f7374796c653d736f6369616c266c6162656c3d53746172) Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen |
![image](https://camo.githubusercontent.com/c2b82be0ca74e1e620c52422f4be6613c4bc6b86614b02f7f73b6842e3324a9b/68747470733a2f2f61727869762e6f72672f68746d6c2f323430362e303831353576312f78312e706e67) |
Github Paper |
![Star](https://camo.githubusercontent.com/9d82b85d55322206d42858d748a1a6a9b54916e3bf0fd553b66543a9627f81b2/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f696d6167696e6174696f6e2d72657365617263682f4545502e7376673f7374796c653d736f6369616c266c6162656c3d53746172) Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang |
![image](https://camo.githubusercontent.com/e2666f15603833823f5a31a94b97871778dcd1d0fee8a564170823078ba8086c/68747470733a2f2f61727869762e6f72672f68746d6c2f323430372e303039343576312f6578747261637465642f353639373337302f466967757265732f7573655f636173652e706e67) |
Github Paper |
Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts Zeliang Zhang, Xiaodong Liu, Hao Cheng, Chenliang Xu, Jianfeng Gao |
![image](https://camo.githubusercontent.com/d6f97cca7b217b7461781148ce4e74d1ad54421d112f6b02242bc3d2433b4f9a/68747470733a2f2f61727869762e6f72672f68746d6c2f323430372e303935393076312f78332e706e67) |
Paper |
![Star](https://camo.githubusercontent.com/9d485a5a9710576f8a90139bf998508e444b308040a9219908ff3c32e0e278ac/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f4161726f6e6875616e672d3737382f4d432d4d6f452e7376673f7374796c653d736f6369616c266c6162656c3d53746172) MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi |
![image](https://github.com/Aaronhuang-778/MC-MoE/raw/main/imgs/WX20241009-191322@2x.png) |
Github Paper |
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai |
|
Paper |