Skip to content

Commit

Permalink
Major update
Browse files Browse the repository at this point in the history
  • Loading branch information
jinjungyu committed Jan 29, 2024
1 parent fc58c0e commit 27e8523
Show file tree
Hide file tree
Showing 18 changed files with 75 additions and 1,849 deletions.
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,9 @@ __pycache__
*.pth
test.py
test.ipynb
experiment
experiment
analysis
output
rebuttal/
*quant_cuda_kernel_*
demo*
201 changes: 0 additions & 201 deletions LICENSE

This file was deleted.

49 changes: 28 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,23 @@
# OWQ: Lessons learned from activation outliers for weight quantization in large language models
# [AAAI 2024 (Oral)]   OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models

This is the code for the paper [OWQ: Lessons learned from activation outliers for weight quantization in large language models](https://arxiv.org/abs/2306.02272). OWQ preserves few weak columns as FP16, while quantizing other weights to 3/4-bits. OWQ achieves substantial quality improvements with only negligible storage and computation overhead, effectively preserving the benefits of low-precision acceleration.
<p align="center">
<img src="./images/owq_llama.png" width="300px" height="300px">
</p>
This is the code for the paper [OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Models](https://arxiv.org/abs/2306.02272). OWQ preserves few weak columns as FP16, while quantizing other weights to 3/4-bits. OWQ achieves substantial quality improvements with only negligible storage and computation overhead, effectively preserving the benefits of low-precision acceleration.

<p align="center">
<br>
<img src="./images/owq_figure.png">
</p>

## Updates (2024-01-22)
## Updates (2024-01-29)
* Integrated all models (OPT, LLaMA, BLOOM, Falcon) into `main.py` file. You can easily add custom or open-accessible huggingface models to `model_config.json` if you want.
* Support 4bit matrix - FP16 vector product CUDA kernel.
* Support BFloat16.

## Features
* Implementation of the OWQ algorithm: `owq/recon.py`, `main.py`
* 3/4-bit weight quantization of LLMs (OPT, LLaMA1,2 families and etc..): `main.py`
* 3/4-bit weight quantization of LLMs (OPT, LLaMA-1,2 families and etc ...): `main.py`
* Evaluating the perplexity of quantized models: `main.py`
* Evaluating the zero-shot accuracy of quantized models: `zeroshot.py`
* Supports 3/4-bit packed weight save / load (~1/5, ~1/4 file size of FP16 checkpoint, respectively.)
Expand All @@ -21,7 +28,7 @@ This is the code for the paper [OWQ: Lessons learned from activation outliers fo
* [Install](#install)
* [Usage](#usage)
* [Zero-shot](#zero-shot)
* [3-bit CUDA kernel](#3-bit-cuda-kernels)
* [3/4-bit CUDA kernels](#34-bit-cuda-kernels)

## Install
We highly recommend to use docker image that supports CUDA. If you use anaconda instead, you need to setup CUDA for kernel use.
Expand Down Expand Up @@ -67,51 +74,51 @@ We have tested 3/4-bit CUDA kernel on the NVIDIA A100, A6000 and RTX3090 GPU.

### Running OWQ & measuring the perplexity (PPL)

Here we use OPT-1.3b model as an example. You can replace the model argument `opt-1.3b` among `opt-125m`, `opt-350m`, `opt-2.7b`, `opt-6.7b`, `opt-13b`, `opt-66b` or other models (e.g. `meta-llama/Llama-2-7b-hf`).
Here we use llama-7b model (huggyllama/llama-7b) as an example. You can replace the model argument `llama-7b` among `llama-13b`, `llama-30b`, and `llama-65b` or other model families (e.g. `meta-llama/Llama-2-7b-hf`, `facebook/opt-6.7b`, `lmsys/vicuna-33b-v1.3`, etc ...).

* OWQ using 3.01-bit (3-bit quantization + few FP16 weight columns)
```
python main.py facebook/opt-1.3b c4 --wbits 3 --target_bit 3.01
python main.py huggyllama/llama-7b c4 --wbits 3 --target_bit 3.01
```
* OWQ using 4.01-bit (4-bit quantization + few FP16 weight columns)
```
python main.py facebook/opt-1.3b c4 --wbits 4 --target_bit 4.01
python main.py huggyllama/llama-7b c4 --wbits 4 --target_bit 4.01
```

Below are the example for the other options (FP16, RTN, GPTQ).
```
# Measuring the ppl of the full precision (FP16) model
python main.py facebook/opt-1.3b c4 --wbits 16
python main.py huggyllama/llama-7b c4 --wbits 16
# 4-bit Round-to-Nearest (RTN) quantization
python main.py facebook/opt-1.3b c4 --wbits 4 --nearest
python main.py huggyllama/llama-7b c4 --wbits 4 --nearest
# GPTQ with 3-bit quantization
python main.py facebook/opt-1.3b c4 --wbits 3 --tuning minmax
python main.py huggyllama/llama-7b c4 --wbits 3 --tuning minmax
```

### Zero-shot
Here we give an example of measuring zero-shot accuracy on `lambada_openai` and `piqa` tasks using opt-125m model.
Here we give an example of measuring zero-shot accuracy on `hellaswag` tasks using llama-7b model.
You need to generate quantized model checkpoint before measuring the zero-shot accuracy.
```
# making checkpoint file of OWQ reconstruction
python main.py facebook/opt-125m c4 --wbits 3 --target_bit 3.05 --no-eval --save opt-125m_3_05.pth --packing
python main.py huggyllama/llama-7b c4 --wbits 3 --target_bit 3.01 --no-eval --save llama-7b_3_01.pth --packing
# measuring zero-shot accuracy (single-gpu)
CUDA_VISIBLE_DEVICES=0 python zeroshot.py --model hf-causal-owq --model_args pretrained=facebook/opt-125m,load=opt-125m_3_05.pth --batch_size 4 --tasks lambada_openai --no_cache
# measuring zero-shot accuracy (using single-gpu)
CUDA_VISIBLE_DEVICES=0 python zeroshot.py --model hf-causal-owq --model_args pretrained=huggyllama/llama-7b,load=llama-7b_3_01.pth --batch_size 4 --tasks hellaswag --no_cache
# multi-gpu
CUDA_VISIBLE_DEVICES=0,1 python zeroshot.py --model hf-causal-owq --model_args pretrained=facebook/opt-125m,load=opt-125m_3_05.pth,use_accelerate=True --batch_size 4 --tasks lambada_openai --no_cache
CUDA_VISIBLE_DEVICES=0,1 python zeroshot.py --model hf-causal-owq --model_args pretrained=huggyllama/llama-7b,load=llama-7b_3_01.pth,use_accelerate=True --batch_size 4 --tasks hellaswag --no_cache
```

### Easy OWQ + Measuring PPL, Zeroshot sample
### Easy OPT OWQ + Measuring PPL, Zeroshot sample
```
bash scripts/opt_end_to_end_evaluation.sh 0 opt-1.3b
```

## Demo
Please refer to the README in the `demo` directory.

## 3-bit CUDA Kernels
## 3/4-bit CUDA Kernels

### Benchmark kernel performance
```
Expand All @@ -120,9 +127,9 @@ cd owq/kernel/
python test_kernel.py
```

### Benchmark language generation with 3/4-bit packed model (opt, llama)
### Benchmark language generation with 3/4-bit packed model (opt, llama, etc...)
```
# Example of OPT-65b language generation (single token)
# Example of OPT-66b language generation (single token)
# Save compressed model
python main.py facebook/opt-66b c4 --wbits 3 --target_bit 3.01 --no-eval --save opt-66b_3_01.pth --packing
Expand Down Expand Up @@ -157,4 +164,4 @@ If you find our code or OWQ useful for your research, please consider citing:
journal={arXiv preprint arXiv:2306.02272},
year={2023}
}
```
```
Loading

0 comments on commit 27e8523

Please sign in to comment.