There are multiple levels of quantization, a one of the simplest method is to store the model weights using reduce number of bits and to convert them beck to bf16 during inference. Here the activations are computed in bf16 itself.
Since most of the GPUs support int8 matrix multiplication, we can also quantize the activations to int8 and perform the matrix multiplication in int8 itself. This can reduce the memory footprint and increase the speed of the model. But this is not straight forward since int8 matrix multiplication may require custom CUDA kernels. Also, the activation quantization mat overflow. Also quantizing the activations require calibrating the quantization parameters using multiple samples.
FP8 is only supported in H100 GPUs but storing approximations in fp8 can be accurate than vanilla int8 quantization. The recent QLoRA paper explores different data types, 4-bit Float and 4-bit NormalFloat which again are only used for storage and not for computation.
-
Intro to weight quantization: https://freedium.cfd/https://medium.com/m/global-identity-2?redirectUrl=https%3A%2F%2Ftowardsdatascience.com%2Fintroduction-to-weight-quantization-2494701b9c0c
-
Holy grail: https://timdettmers.com/2023/01/30/which-gpu-for-deep-learning/
-
GPT Fast (Read for good quantization implementation) : https://github.com/pytorch-labs/gpt-fast
-
Simple notebook: https://colab.research.google.com/drive/1oDfcLRz2AIgsclkXJHj-5wMvbylr4Nxz#scrollTo=iCsoFvwLrgdu
-
k-bit scaling laws, basically says that 4bit is best, even better than 8bit: https://arxiv.org/pdf/2212.09720.pdf#page=6.11
-
GGUF: mainly bock quantization for use with CPU only: https://kaitchup.substack.com/p/gguf-quantization-for-fast-and-memory
-
AWQ: Activation aware quantization: Uses the distribution of activations to quantize them.
-
GPTQ: https://arxiv.org/pdf/2210.17323.pdf
- Uses 4bit quantization and 16bit computation, the difference with gguf is that it uses a different quantization method.
- Explanation video: https://www.youtube.com/watch?v=05v2MA3CXKo
-
Smooth Quantization+, 4 bit quantization: https://arxiv.org/pdf/2312.03788.pdf
-
6bit quantization: https://arxiv.org/pdf/2310.05079.pdf
-
QLLM, recent SoTA 4bit: https://arxiv.org/pdf/2310.08041.pdf
-
OmniQuant, recent SoTA method: Both weight and activation quantization: https://github.com/OpenGVLab/OmniQuant?tab=readme-ov-file
-
Comparison of quantization methods:
-
Old quant method: https://github.com/yhhhli/BRECQ
-
Start with simple quantization for storage and see the impact on inference latency and quality. (Since model is small, no outliers are there)
-
Use bnb to check int8 and int4 quantization.
-
INT8 quantization since RTX20 and future GPUs support it. But may require CUDA kernels.
- llm.int8() paper https://arxiv.org/pdf/2208.07339.pdf
- Explanation of 8bit: https://huggingface.co/blog/hf-bitsandbytes-integration
-
Check the activation distributions
-
Check Activation aware quantization. (Mabe too much work)
- GPUs support int8 matrix multiplication which is used in llm.int8() by Tim Detmers.
- Only H100 have native fp8 support which is fast but not much use since limited to H100.
- Using torch autocast/amp is simple and good to convert between dtypes, maybe better techniques exist but this is also good
- https://pytorch.org/blog/accelerating-generative-ai-3/
- https://pytorch.org/blog/accelerating-generative-ai-2/
- Compile with max auto-tune.
- Compute QKV in one go.
During the iterative reverse diffusion process, we call step() on the scheduler each time after the denoiser predicts the less noisy latent embeddings. Inside step(), the sigmas variable is indexed. If the sigmas array is placed on the GPU, indexing causes a communication sync between the CPU and GPU. This causes a latency, and it becomes more evident when the denoiser has already been compiled.
But if the sigmas array always stays on the CPU (refer to this line), this sync doesn’t take place, hence improved latency. In general, any CPU <-> GPU communication sync should be none or be kept to a bare minimum as it can impact inference latency.
- https://github.com/Xiuyu-Li/q-diffusion/tree/master
- https://www.youtube.com/watch?v=virARwF_pt4&t=1669s
- SD3 paper: https://arxiv.org/pdf/2403.03206.pdf
- Float8 in Pytorch: https://dev-discuss.pytorch.org/t/float8-in-pytorch-1-x/1815
- 4but QLoRA: https://huggingface.co/blog/4bit-transformers-bitsandbytes
- https://github.com/IST-DASLab/marlin
- https://github.com/TimDettmers/bitsandbytes
- https://github.com/turboderp/exllama/tree/master/exllama_ext/cuda_func
- huggingface/optimum-quanto#65
- 4/8 bit in diffuser: huggingface/diffusers#6500
- fp8 storage: AUTOMATIC1111/stable-diffusion-webui#14031
- 4bit Qlinear: huggingface/optimum-quanto#65
- QX4: ggerganov/llama.cpp#1240
- Quantized linear layer: https://discuss.pytorch.org/t/understanding-quantized-linear-layer/154000
- GPTQ & bnb benchmarking by TheBloke: AutoGPTQ/AutoGPTQ#49 (comment)
Qualcomm whitepaper shows that the hardware implementation of the FP8 format is somewhere between 50% to 180% less efficient than INT8 in terms of chip area and energy usage. This is because of the additional logic needed in the accumulation of FP formats versus integer formats. This seems like a broad range, but the actual efficiency depends on many hardware design choices that vary greatly. A similar conclusion was reached recently by Microsoft and Meta: Floating-point arithmetic is just much less efficient than integer arithmetic.
This means that FP8 will have to be significantly more accurate than INT8 to be worthwhile from a hardware-efficiency perspective.
Biases are not converted because to preserve the accuracy of a typical addmm operation, they must be converted with a scale that is equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely requires a very high bitwidth to avoid clipping.
https://pytorch.org/docs/stable/amp.html#torch.autocast
matmul, addbmm, addmm, addmv, addr, baddbmm, bmm, chain_matmul, multi_dot, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, GRUCell, linear, LSTMCell, matmul, mm, mv, prelu, RNNCell
pow, rdiv, rpow, rtruediv, acos, asin, binary_cross_entropy_with_logits, cosh, cosine_embedding_loss, cdist, cosine_similarity, cross_entropy, cumprod, cumsum, dist, erfinv, exp, expm1, group_norm, hinge_embedding_loss, kl_div, l1_loss, layer_norm, log, log_softmax, log10, log1p, log2, margin_ranking_loss, mse_loss, multilabel_margin_loss, multi_margin_loss, nll_loss, norm, normalize, pdist, poisson_nll_loss, pow, prod, reciprocal, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softmax, softmin, softplus, sum, renorm, tan, triplet_margin_loss
These ops don’t require a particular dtype for stability, but take multiple inputs and require that the inputs’ dtypes match. If all of the inputs are float16, the op runs in float16. If any of the inputs is float32, autocast casts all inputs to float32 and runs the op in float32.
addcdiv, addcmul, atan2, bilinear, cross, dot, grid_sample, index_put, scatter_add, tensordot
Some ops not listed here (e.g., binary ops like add) natively promote inputs without autocasting’s intervention. If inputs are a mixture of float16 and float32, these ops run in float32 and produce float32 output, regardless of whether autocast is enabled.
Add the following to profile CUDA kernels in PyTorch
import torch
torch.cuda.cudart().cudaProfilerStart()
torch.cuda.nvtx.range_push("backward")
# Pytorch code using CUDA kernels (for example, model inference)
torch.cuda.nvtx.range_pop()
torch.cuda.cudart().cudaProfilerStop()
Run the following command to profile the CUDA kernels
nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu --capture-range=cudaProfilerApi --cudabacktrace=true -x true -o bf16_true_disabled_doubleq_dis2 tests/test_linear_modules.py
Sources: