Skip to content

d1pankarmedhi/qllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

qllm

🌩 Lightweight model quantization tool

The motivation behind this tool is to reduce model size and memory usage for inference. Storing large models can be burdensome, especially when dealing with edge devices. A model with a smaller footprint often simplifies and streamlines the inference process.

qllm tool allows developers to quantize any Hugging Face model from the HF Hub without losing much in terms of performance and accuracy. The underlying method is a simple linear quantization technique (both asymmetric and symmetric), enabling quick and ready-to-go quantization on the fly.

Note: This tool is not ready for production environments and should only be used for building intuition. Consider it a tool for model quantization experimentation.

Getting started

Download and load any model from huggingface. This example uses facebook/opt-350m model.

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch 

model_id = "facebook/opt-350m"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# The Original footprint of the model: 662.392832 MB

Run and test the original model.

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("What are we having for dinner?"))

# [{'generated_text': "What are we having for dinner?\nI'm having a steak and a salad.\nI'm"}]

Load and replace the Linear Layers of the model by qllm's LL layers and run quantization.

from qllm import LinearQuantizer
from qllm.layers import W8A16LL

LinearQuantizer.replace_and_quantize_modules(model, W8A16LL, ['lm_head'])
# Model footprint after quantization: 359.799808 MB

Test the model after quantization

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("What are we having for dinner?"))

# [{'generated_text': "What are we having for dinner?\nI'm having a steak dinner.\nI'm having a"}]

Observation

It is around 46% reduction in model footprint which is outstanding. This may not be the same for every model but reduction of anywhere around 30% is significantly helpful from the inference point of view.

Releases

No releases published

Packages

No packages published

Languages