QLLM

QLLM is a out-of-box quantization toolbox for large language models, It didn't limit to a specific model, and designed to be auto-quantization layer by layer for any LLMs. It can also be used to export quantized model to onnx with only one args --export_onnx ./onnx_model, and inference with onnxruntime. Besides, model quantized by different quantization method (GPTQ/AWQ) can be loaded from huggingface/transformers and transfor to each other without extra effort.

We alread supported

GPTQ quantization
AWQ quantization

Features:

GPTQ supports all LLM models in huggingface/transformers, it will automatically detect the model type and quantize it.
for GPTQ, we support to quantize model by 2-8 bits, and support to quantize model with different quantization bits for different layers.
for AWQ, we support only those models in llm-awq/auto-awq for now.
we support to load model which quantized by AutoGPTQ and AutoAWQ.
we only support Nvidia-GPU platform for now,
we will consider support AMD-GPU.

Installation

pip install git+https://github.com/wejoncy/QLLM.git

Dependencies

torch: tested on v2.0.0+cu117
transformers: tested on v4.28.0.dev0
datasets: tested on v2.10.1
safetensors: tested on v0.3.0
onnxruntime: tested on v1.16.1
onnx

Model Quantization

#  Quantize and Save compressed model
CUDA_VISIBLE_DEVICES=0 python -m qllm.run --model=meta-llama/Llama-2-7b-hf --method=gptq --save ./Llama-2-7b-4bit

Convert to onnx model

use --export_onnx ./onnx_model to export and save onnx model

python -m qllm.run --model  meta-llama/Llama-2-7b-chat-hf  --method=gptq  --dataset=pileval --nsamples=16  --save ./Llama-2-7b-chat-hf_awq_q4/ --export_onnx ./Llama-2-7b-chat-hf_awq_q4_onnx/

model inference with the saved model

CUDA_VISIBLE_DEVICES=0 python -m qllm.run --load ./Llama-2-7b-4bit --eval

model inference with ORT

import onnxruntime
from transformers import AutoTokenizer
onnx_path_str = './Llama-2-7b-4bit-onnx'

tokenizer = AutoTokenizer.from_pretrained(onnx_path_str, use_fast=True)
sample_inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
onnx_model_path = onnx_path_str+'/model_one_for_all.onnx'
session = onnxruntime.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])
mask = np.ones(sample_inputs[0].shape, dtype=np.int64) if sample_inputs[1] is None else sample_inputs[1].cpu().numpy()
num_layers = model.config.num_hidden_layers
inputs = {'input_ids': sample_inputs[0].cpu().numpy(), 'attention_mask': mask, 'use_cache_branch': np.array([0], dtype=np.bool_)}
for i in range(num_layers):
    inputs[f'present_key.{i}'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
    inputs[f'present_values.{i}'] = np.zeros((1, 32, 32, 128), dtype=np.float16)
outputs = session.run(None, inputs)

Load quantized model from hugingface/transformers

CUDA_VISIBLE_DEVICES=0 python -m qllm.run --load TheBloke/Llama-2-7B-Chat-AWQ --eval
CUDA_VISIBLE_DEVICES=0 python -m qllm.run --load TheBloke/Llama-2-7B-Chat-GPTQ --eval

start a chatbot

use --use_plugin to enable a chatbot plugin

python -m qllm.run --model  meta-llama/Llama-2-7b-chat-hf  --method=awq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_awq_q4/

or 
python -m qllm.run --model  meta-llama/Llama-2-7b-chat-hf  --method=gptq  --dataset=pileval --nsamples=16  --use_plugin --save ./Llama-2-7b-chat-hf_gptq_q4/

Acknowledgements

This code is based on GPTQ

Triton GPTQ kernel code is based on GPTQ-triton

Thanks to AutoGPTQ

Thanks to llm-awq and AutoAWQ for releasing AWQ quantization method.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github/workflows		.github/workflows
assets		assets
qllm		qllm
src		src
.gitignore		.gitignore
.style.yapf		.style.yapf
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
versions.txt		versions.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QLLM

Installation

Dependencies

Model Quantization

Convert to onnx model

model inference with the saved model

model inference with ORT

Load quantized model from hugingface/transformers

start a chatbot

Acknowledgements

About

Releases

Packages

Languages

License

aciddelgado/QLLM

Folders and files

Latest commit

History

Repository files navigation

QLLM

Installation

Dependencies

Model Quantization

Convert to onnx model

model inference with the saved model

model inference with ORT

Load quantized model from hugingface/transformers

start a chatbot

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages