QMoE

This repository contains the full code of the paper QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models.

It is organized as follows:

datautils.py: utilities for dataset loading
gptq.py: robust batch-implementation of GPTQ
quant.py: quantization utilities
sub1.py: efficient inference of compressed models
sub1_cuda_kernel.cu: CUDA kernels
switch.py: the efficient QMoE compression framework
test.py: per-layer benchmarks and ideal compression rates

Dependencies

The project was developed with:

torch==2.0.0+cu117
transformers==4.28.0
datasets==2.10.1
CUDA 11.4 GPU drivers

CUDA kernels for compressed storage and inference can be installed via:

python setup_cuda.py install

Usage

Now follows a list of sample commands for running different experiments.

# BF16 baseline eval on C4 
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 
# BF16 baseline eval on additional datasets 
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --detaileval
# ternary round to nearest baseline 
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --nearest 

# ternary compression with QMoE, saving the compressed model for later inference
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 1.5 --trainsamples 10000 --save PATH_TO_COMP_MODEL
# 2-bit compression with QMoE
CUDA_VISIBLE_DEVICES=0 python switch.py google/switch-base-128 --wbits 2 --trainsamples 10000

# test kernels and compute ideal compression rates 
CUDA_VISIBLE_DEVICES=0 python test.py
# run per-layer benchmarks
CUDA_VISIBLE_DEVICES=0 python test.py --benchmark

# run eval of stored compressed model
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --valsamples 128 
# run end-to-end benchmark
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128
# run simulated end-to-end benchmark for BF16
CUDA_VISIBLE_DEVICES=0 python sub1.py PATH_TO_COMP_MODEL --gentokens 128 --simul

In general, you can pass google/switch-large-128 and google/switch-c-2048 to run on large-128 and c-2048, respectively. We note that other SwitchTransformer models than those 3 may not work out-of-the-box due to Hugging Face bugs.

Always specify CUDA_VISIBLE_DEVICES since some commands, like sub1.py, will otherwise attempt to use all available GPUs.

Compressed Models

Our models in compressed custom QMoE format are available on Hugging Face: base-128, large-128 and c-2048. To use them, clone the repository and then simply pass their path to sub1.py.

Cite

If you found this work useful, please consider citing:

@article{frantar-qmoe,
  title={{QMoE}: Practical Sub-1-Bit Compression of Trillion-Parameter Models}
  author={Elias Frantar and Dan Alistarh},
  year={2023},
  journal={arXiv preprint, arxiv:2310.16795}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QMoE

Dependencies

Usage

Compressed Models

Cite

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
datautils.py		datautils.py
gptq.py		gptq.py
quant.py		quant.py
setup_cuda.py		setup_cuda.py
sub1.py		sub1.py
sub1_cuda.cpp		sub1_cuda.cpp
sub1_cuda_kernel.cu		sub1_cuda_kernel.cu
switch.py		switch.py
test.py		test.py

License

IST-DASLab/qmoe

Folders and files

Latest commit

History

Repository files navigation

QMoE

Dependencies

Usage

Compressed Models

Cite

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages