QLLM is a out-of-box quantization toolbox for large language models, It didn't limit to a specific model, and designed to be auto-quantization layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args --export_onnx ./onnx_model
, and inference with onnxruntime.
Besides, model quantized by different quantization method (GPTQ/AWQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.
We alread supported
- GPTQ quantization
- AWQ quantization
Features:
- GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
- for GPTQ, we support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
- for AWQ, we support only those models in llm-awq/auto-awq for now.
- we support to load model which quantized by AutoGPTQ and AutoAWQ.
- we only support Nvidia-GPU platform for now,
- we will consider support AMD-GPU.
pip install git+https://github.com/wejoncy/QLLM.git
torch
: tested on v2.0.0+cu117transformers
: tested on v4.28.0.dev0datasets
: tested on v2.10.1safetensors
: tested on v0.3.0onnxruntime
: tested on v1.16.1onnx
# Quantize and Save compressed model
CUDA_VISIBLE_DEVICES=0 python -m qllm.run --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit
use --export_onnx ./onnx_model
to export and save onnx model
python -m qllm.run --model meta-llama/Llama-2-7b-chat-hf --method=gptq --dataset=pileval --nsamples=16 --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/
CUDA_VISIBLE_DEVICES=0 python -m qllm.run --load ./Llama-2-7b-4bit --eval
import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'
tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/model_one_for_all.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(), 'attention_mask': mask, 'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
inputs[f'present_key.{i}'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
inputs[f'present_values.{i}'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session.run(None, inputs)
CUDA_VISIBLE_DEVICES=0 python -m qllm.run --load TheBloke/Llama-2-7B-Chat-AWQ --eval
CUDA_VISIBLE_DEVICES=0 python -m qllm.run --load TheBloke/Llama-2-7B-Chat-GPTQ --eval
use --use_plugin
to enable a chatbot plugin
python -m qllm.run --model meta-llama/Llama-2-7b-chat-hf --method=awq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/
or
python -m qllm.run --model meta-llama/Llama-2-7b-chat-hf --method=gptq --dataset=pileval --nsamples=16 --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/
This code is based on GPTQ
Triton GPTQ kernel code is based on GPTQ-triton
Thanks to AutoGPTQ
Thanks to llm-awq and AutoAWQ for releasing AWQ quantization method.