Skip to content

We did some modifications/enhancements for original FlexGen.

License

Notifications You must be signed in to change notification settings

PASAUCMerced/FlexGen-Extension

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FlexGen-Extension

We did some modifications/enhancements based on original FlexGen. Below are the details. We keep the same API as the original FlexGen.

FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU [paper]

FlexGen is a high-throughput generation engine for running large language models with limited GPU memory. FlexGen allows high-throughput generation by IO-efficient offloading, compression, and large effective batch sizes.

Installation

Requirements:

Method 1: From source

git clone https://github.com/PASAUCMerced/FlexGen-Extension.git
cd FlexGen-Extension
pip install -e .

Usage and Examples

LLaMA-1 Support

You can run LLaMA-1 with FlexGen by using the following command:

python3 -m flexgen.flex_llama --model "huggyllama/llama-30b"

(1) The execution behaves like OPT, just replace flex_opt with flex_llama, and model to "huggyllama/llama-x" (x is the model size, e.g., 7b, 13b, 30b and 65b).

(2) Other options are same as FlexGen, including how you set offloading percentage, compression, and cpu-cache-compute.

Get Started with a Single GPU

OPT-1.3B

To get started, you can try a small model like OPT-1.3B first. It fits into a single GPU so no offloading is required. FlexGen will automatically download weights from Hugging Face.

python3 -m flexgen.flex_opt --model facebook/opt-1.3b

You should see some text generated by OPT-1.3B and the benchmark results.

OPT-30B

To run large models like OPT-30B, you will need to use CPU offloading. You can try commands below. The --percent argument specifies the offloading strategy for parameters, attention cache and hidden states separately. The exact meaning of this argument can be found here.

python3 -m flexgen.flex_opt --model facebook/opt-30b --percent 0 100 100 0 100 0

OPT-175B

To run OPT-175B, you need to download the weights from metaseq and convert the weights into Alpa format. You can then try to offloading all weights to disk by

python3 -m flexgen.flex_opt --model facebook/opt-175b --percent 0 0 100 0 100 0 --offload-dir YOUR_SSD_FOLDER

Run HELM Benchmark with FlexGen

FlexGen can be integrated into HELM, a language model benchmark framework, as its execution backend. You can use the commands below to run a Massive Multitask Language Understanding (MMLU) scenario with a single T4 (16GB) GPU and 200GB of DRAM.

pip install crfm-helm
python3 -m flexgen.apps.helm_run --description mmlu:model=text,subject=abstract_algebra,data_augmentation=canonical --pad-to-seq-len 512 --model facebook/opt-30b --percent 20 80 0 100 0 100 --gpu-batch-size 48 --num-gpu-batches 3 --max-eval-instance 100

Note that only a subset of HELM scenarios is tested. See more tested scenarios here.

Run Data Wrangling Tasks with FlexGen

You can run the examples in this paper, 'Can Foundation Models Wrangle Your Data?', by following the instructions here.

Scaling to Distributed GPUs

If you have multiple machines with GPUs, FlexGen can combine offloading with pipeline parallelism to allow scaling. For example, if you have 2 GPUs but the aggregated GPU memory is less than the model size, you still need offloading. FlexGen allow you to do pipeline parallelism with these 2 GPUs to accelerate the generation. But to have scaled performance, you should have GPUs on distributed machines. See examples here.

API Example

We demonstrate the usage of FlexGen API in completion.py. This example shows how to run generation for two sentences. To get the best throughput out of FlexGen, you typically need to batch more sentences.

Generation API

FlexGen has a generation API following the style of Hugging Face's transformers.

output_ids = model.generate(
	input_ids,
	do_sample=True,
	temperature=0.7,
	max_new_tokens=32,
	stop=stop)

Example Commands

You can use the example commands below. If you do not have enough GPU/CPU memory, see the Handle Out-Of-Memory section.

# Complete with OPT-6.7B. You need at least 15GB of GPU memory.
python3 -m flexgen.apps.completion --model facebook/opt-6.7b
# Complete with OPT-30B. You need about 90GB of CPU memory.
python3 -m flexgen.apps.completion --model facebook/opt-30b --percent 0 100 100 0 100 0
# Complete with instruction-tuned OPT-IML-MAX-30B. You need about 90GB of CPU memory.
python3 -m flexgen.apps.completion --model facebook/opt-iml-max-30b --percent 0 100 100 0 100 0

Frequently Asked Questions

How to set the offloading strategy and --percent?

We will release an automatic policy optimizer later, but now you have to manually try a few strategies. The idea of high-throughput generation is to offload parameters and attention cache as much as possible to the CPU and disk if necessary. You can see the reference strategies in our benchmark here. To avoid out-of-memory, you can tune the --percent to offload more tensors to the CPU and disk.

How to handle out-of-memory?

If you do not have enough GPU/CPU memory, here are a few things you can try. They save more memory but run slower.

  • Do not pin weights by adding --pin-weight 0. This can reduce the weight memory usage on CPU by around 20% or more.
  • Enable weight compression by adding --compress-weight. This can reduce the weight memory usage by around 70%.
  • Offload all weights to disk by using --percent 0 0 100 0 100 0. This requires very little CPU and GPU memory.

Citation

@article{wu2024lm,
  title={LM-Offload: Performance Model-Guided Generative Inference of Large Language Models with Parallelism Control},
  author={Wu, Jianbo and Ren, Jie and Yang, Shuangyan and Parasyris, Konstantinos and Georgakoudis, Giorgis and Laguna, Ignacio and Li, Dong},
  year={2024}
}
@article{liu2024exploring,
  title={Exploring and Evaluating Real-world CXL: Use Cases and System Adoption},
  author={Liu, Jie and Wang, Xi and Wu, Jianbo and Yang, Shuangyan and Ren, Jie and Shankar, Bhanu and Li, Dong},
  journal={arXiv preprint arXiv:2405.14209},
  year={2024}
}
@inproceedings{sheng2023flexgen,
  title={Flexgen: High-throughput generative inference of large language models with a single gpu},
  author={Sheng, Ying and Zheng, Lianmin and Yuan, Binhang and Li, Zhuohan and Ryabinin, Max and Chen, Beidi and Liang, Percy and R{\'e}, Christopher and Stoica, Ion and Zhang, Ce},
  booktitle={International Conference on Machine Learning},
  pages={31094--31116},
  year={2023},
  organization={PMLR}
}

About

We did some modifications/enhancements for original FlexGen.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published