We did some modifications/enhancements based on original FlexGen. Below are the details. We keep the same API as the original FlexGen.
FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU [paper]
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.
Requirements:
- PyTorch >= 1.12 (Help)
git clone https://github.com/PASAUCMerced/FlexGen-Extension.git
cd FlexGen-Extension
pip install -e .
You can run LLaMA-1 with FlexGen by using the following command:
python3 -m flexgen.flex_llama --model "huggyllama/llama-30b"
(1) The execution behaves like OPT, just replace flex_opt with flex_llama, and model to "huggyllama/llama-x" (x is the model size, e.g., 7b, 13b, 30b and 65b).
(2) Other options are same as FlexGen, including how you set offloading percentage, compression, and cpu-cache-compute.
To get started, you can try a small model like OPT-1.3B first. It fits into a single GPU so no offloading is required. FlexGen will automatically download weights from Hugging Face.
python3 -m flexgen.flex_opt --model facebook/opt-1.3b
You should see some text generated by OPT-1.3B and the benchmark results.
To run large models like OPT-30B, you will need to use CPU offloading. You can try commands below.
The --percent
argument specifies the offloading strategy for parameters, attention cache and hidden states separately.
The exact meaning of this argument can be found here.
python3 -m flexgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0
To run OPT-175B, you need to download the weights from metaseq and convert the weights into Alpa format. You can then try to offloading all weights to disk by
python3 -m flexgen.flex_opt --model facebook/opt-175b --percent 0 0 100 0 100 0 --offload-dir YOUR_SSD_FOLDER
FlexGen can be integrated into HELM, a language model benchmark framework, as its execution backend. You can use the commands below to run a Massive Multitask Language Understanding (MMLU) scenario with a single T4 (16GB) GPU and 200GB of DRAM.
pip install crfm-helm
python3 -m flexgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100
Note that only a subset of HELM scenarios is tested. See more tested scenarios here.
You can run the examples in this paper, 'Can Foundation Models Wrangle Your Data?', by following the instructions here.
If you have multiple machines with GPUs, FlexGen can combine offloading with pipeline parallelism to allow scaling. For example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation. But to have scaled performance, you should have GPUs on distributed machines. See examples here.
We demonstrate the usage of FlexGen API in completion.py. This example shows how to run generation for two sentences. To get the best throughput out of FlexGen, you typically need to batch more sentences.
FlexGen has a generation API following the style of Hugging Face's transformers.
output_ids = model.generate(
input_ids,
do_sample=True,
temperature=0.7,
max_new_tokens=32,
stop=stop)
You can use the example commands below. If you do not have enough GPU/CPU memory, see the Handle Out-Of-Memory section.
# Complete with OPT-6.7B. You need at least 15GB of GPU memory.
python3 -m flexgen.apps.completion --model facebook/opt-6.7b
# Complete with OPT-30B. You need about 90GB of CPU memory.
python3 -m flexgen.apps.completion --model facebook/opt-30b --percent 0 100 100 0 100 0
# Complete with instruction-tuned OPT-IML-MAX-30B. You need about 90GB of CPU memory.
python3 -m flexgen.apps.completion --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0
We will release an automatic policy optimizer later, but now you have to manually try a few strategies.
The idea of high-throughput generation is to offload parameters and attention cache as much as possible to the CPU and disk if necessary.
You can see the reference strategies in our benchmark here.
To avoid out-of-memory, you can tune the --percent
to offload more tensors to the CPU and disk.
If you do not have enough GPU/CPU memory, here are a few things you can try. They save more memory but run slower.
- Do not pin weights by adding
--pin-weight 0
. This can reduce the weight memory usage on CPU by around 20% or more. - Enable weight compression by adding
--compress-weight
. This can reduce the weight memory usage by around 70%. - Offload all weights to disk by using
--percent 0 0 100 0 100 0
. This requires very little CPU and GPU memory.
@article{wu2024lm,
title={LM-Offload: Performance Model-Guided Generative Inference of Large Language Models with Parallelism Control},
author={Wu, Jianbo and Ren, Jie and Yang, Shuangyan and Parasyris, Konstantinos and Georgakoudis, Giorgis and Laguna, Ignacio and Li, Dong},
year={2024}
}
@article{liu2024exploring,
title={Exploring and Evaluating Real-world CXL: Use Cases and System Adoption},
author={Liu, Jie and Wang, Xi and Wu, Jianbo and Yang, Shuangyan and Ren, Jie and Shankar, Bhanu and Li, Dong},
journal={arXiv preprint arXiv:2405.14209},
year={2024}
}
@inproceedings{sheng2023flexgen,
title={Flexgen: High-throughput generative inference of large language models with a single gpu},
author={Sheng, Ying and Zheng, Lianmin and Yuan, Binhang and Li, Zhuohan and Ryabinin, Max and Chen, Beidi and Liang, Percy and R{\'e}, Christopher and Stoica, Ion and Zhang, Ce},
booktitle={International Conference on Machine Learning},
pages={31094--31116},
year={2023},
organization={PMLR}
}