Skip to content

Latest commit

 

History

History
157 lines (102 loc) · 9.41 KB

quant.md

File metadata and controls

157 lines (102 loc) · 9.41 KB

Quantization

There are multiple levels of quantization, a one of the simplest method is to store the model weights using reduce number of bits and to convert them beck to bf16 during inference. Here the activations are computed in bf16 itself.

Since most of the GPUs support int8 matrix multiplication, we can also quantize the activations to int8 and perform the matrix multiplication in int8 itself. This can reduce the memory footprint and increase the speed of the model. But this is not straight forward since int8 matrix multiplication may require custom CUDA kernels. Also, the activation quantization mat overflow. Also quantizing the activations require calibrating the quantization parameters using multiple samples.

FP8 is only supported in H100 GPUs but storing approximations in fp8 can be accurate than vanilla int8 quantization. The recent QLoRA paper explores different data types, 4-bit Float and 4-bit NormalFloat which again are only used for storage and not for computation.

Types of SoTA quantization methods

Ideas

  • Start with simple quantization for storage and see the impact on inference latency and quality. (Since model is small, no outliers are there)

  • Use bnb to check int8 and int4 quantization.

  • INT8 quantization since RTX20 and future GPUs support it. But may require CUDA kernels.

  • Check the activation distributions

  • Check Activation aware quantization. (Mabe too much work)

General Points

  • GPUs support int8 matrix multiplication which is used in llm.int8() by Tim Detmers.
  • Only H100 have native fp8 support which is fast but not much use since limited to H100.
  • Using torch autocast/amp is simple and good to convert between dtypes, maybe better techniques exist but this is also good

Other general Optimizations

Getting rid of GPU syncs after compilation

During the iterative reverse diffusion process, we call step() on the scheduler each time after the denoiser predicts the less noisy latent embeddings. Inside step(), the sigmas variable is indexed. If the sigmas array is placed on the GPU, indexing causes a communication sync between the CPU and GPU. This causes a latency, and it becomes more evident when the denoiser has already been compiled.

But if the sigmas array always stays on the CPU (refer to this line), this sync doesn’t take place, hence improved latency. In general, any CPU <-> GPU communication sync should be none or be kept to a bare minimum as it can impact inference latency.

Quantize Diffusion

Read

Libraries

CUDA references

Good discussions

Misc

FP8 vs INT8

Qualcomm whitepaper shows that the hardware implementation of the FP8 format is somewhere between 50% to 180% less efficient than INT8 in terms of chip area and energy usage. This is because of the additional logic needed in the accumulation of FP formats versus integer formats. This seems like a broad range, but the actual efficiency depends on many hardware design choices that vary greatly. A similar conclusion was reached recently by Microsoft and Meta: Floating-point arithmetic is just much less efficient than integer arithmetic.

This means that FP8 will have to be significantly more accurate than INT8 to be worthwhile from a hardware-efficiency perspective.

Quantizing bias

Biases are not converted because to preserve the accuracy of a typical addmm operation, they must be converted with a scale that is equal to the product of the input and weight scales, which leads to a ridiculously small scale, and conversely requires a very high bitwidth to avoid clipping.

Quantization layer reference

https://pytorch.org/docs/stable/amp.html#torch.autocast

CUDA Ops that can autocast to float16

matmul, addbmm, addmm, addmv, addr, baddbmm, bmm, chain_matmul, multi_dot, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, GRUCell, linear, LSTMCell, matmul, mm, mv, prelu, RNNCell

CUDA Ops that can autocast to float32

pow, rdiv, rpow, rtruediv, acos, asin, binary_cross_entropy_with_logits, cosh, cosine_embedding_loss, cdist, cosine_similarity, cross_entropy, cumprod, cumsum, dist, erfinv, exp, expm1, group_norm, hinge_embedding_loss, kl_div, l1_loss, layer_norm, log, log_softmax, log10, log1p, log2, margin_ranking_loss, mse_loss, multilabel_margin_loss, multi_margin_loss, nll_loss, norm, normalize, pdist, poisson_nll_loss, pow, prod, reciprocal, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softmax, softmin, softplus, sum, renorm, tan, triplet_margin_loss

CUDA Ops that promote to the widest input type

These ops don’t require a particular dtype for stability, but take multiple inputs and require that the inputs’ dtypes match. If all of the inputs are float16, the op runs in float16. If any of the inputs is float32, autocast casts all inputs to float32 and runs the op in float32.

addcdiv, addcmul, atan2, bilinear, cross, dot, grid_sample, index_put, scatter_add, tensordot

Some ops not listed here (e.g., binary ops like add) natively promote inputs without autocasting’s intervention. If inputs are a mixture of float16 and float32, these ops run in float32 and produce float32 output, regardless of whether autocast is enabled.

Profile CUDA kernels

Add the following to profile CUDA kernels in PyTorch

import torch
torch.cuda.cudart().cudaProfilerStart()
torch.cuda.nvtx.range_push("backward")
# Pytorch code using CUDA kernels (for example, model inference)
torch.cuda.nvtx.range_pop()
torch.cuda.cudart().cudaProfilerStop()

Run the following command to profile the CUDA kernels

nsys profile -w true -t cuda,nvtx,osrt,cudnn,cublas -s cpu  --capture-range=cudaProfilerApi  --cudabacktrace=true -x true -o bf16_true_disabled_doubleq_dis2 tests/test_linear_modules.py 

Sources: