Skip to content

Commit 51d1447

Browse files
committed
Final toc format
1 parent fefdc98 commit 51d1447

File tree

1 file changed

+10
-12
lines changed

1 file changed

+10
-12
lines changed

page.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -37,19 +37,17 @@ license_url: "https://opensource.org/license/mit"
3737

3838

3939
Quick Jump:
40-
41-
4240
- [Why?](#why)
4341
- [How?](#how)
4442
- [Note on executing fp8 models](#note-on-executing-fp8-models)
4543
- [fp8 bit format](#fp8-bit-format)
4644
- [Quantization - scaling to lower precision loss \& handle large values](#quantization---scaling-to-lower-precision-loss--handle-large-values)
4745
- [Finer grained scale - weight block size](#finer-grained-scale---weight-block-size)
4846
- [Saving a quantized checkpoint](#saving-a-quantized-checkpoint)
49-
- [Add the scales to `Linear` layers](#add-the-scales-to-linear-layers)
47+
- [Add the scales to Linear layers](#add-the-scales-to-linear-layers)
5048
- [Update model config](#update-model-config)
5149

52-
# Why?
50+
## Why?
5351

5452
tl;dr:
5553

@@ -67,15 +65,15 @@ Starting with NVIDIA H100 GPU, GPUs have *hardware support* for 8 bit floating p
6765
3. Depending on the GPU, fp8 FLOPS are just higher than `bf16` FLOPS. E.g. See [H100 specifications](https://www.nvidia.com/en-us/data-center/h100/); bfloat16 has ~2k teraFLOPS and fp8 has ~4k teraFLOPS
6866

6967

70-
# How?
68+
## How?
7169

72-
## Note on executing fp8 models
70+
### Note on executing fp8 models
7371

7472
When we talk about `fp8` models, we typically only are talking about the **weights being `fp8`**. The actual execution of the model is still done in `bf16`. So all the **intermediate tensors are still in `bf16`**, and it's the underlying CUDA kernels that are taking in `bf16` tensors and `fp8` weights.
7573

7674
**fp8 models still use `bf16` kv cache by default** (since the kv cache stores kv values, which are intermediate tensors).
7775

78-
## fp8 bit format
76+
### fp8 bit format
7977

8078
There are a number of different `fp8` formats; the most common is `float8_e4m3fn`. Here are some facts about it:
8179

@@ -108,7 +106,7 @@ So this leads us with two questions for quantization:
108106
1. `bf16` can store values between `[-3.38953e+38, +3.38953e+38]`, how do we fit that into `fp8` range of `[-448, +448]`?
109107
2. How do we take advantage of the distribution of values in `fp8`?
110108

111-
## Quantization - scaling to lower precision loss & handle large values
109+
### Quantization - scaling to lower precision loss & handle large values
112110

113111
Since `bf16` and `fp8` have different ranges, we need to scale the values to fit into the `fp8` range. This scale is based
114112
on the max value of the data at `bf16`, and is roughly computed like:
@@ -128,7 +126,7 @@ And to dequantize (which is essentially done on the fly at runtime inside the CU
128126
x_dequantized = x.to(torch.bfloat16) * scale
129127
```
130128

131-
## Finer grained scale - weight block size
129+
### Finer grained scale - weight block size
132130

133131
Above I showed the scale being a single value, but you can also have it be a tensor. If you look at some popular open source `fp8` models they typically use this option.
134132

@@ -146,11 +144,11 @@ scale = x.abs().amax(dim=[1, 3]) / 448
146144
assert scale.shape == torch.Size([N // n, K // k])
147145
```
148146

149-
# Saving a quantized checkpoint
147+
## Saving a quantized checkpoint
150148

151149
For compatibility with things like VLLM there's a couple things we need to do:
152150

153-
## Add the scales to `Linear` layers
151+
### Add the scales to Linear layers
154152

155153
We need to add the previously computed `weight_scale` as a parameter to each of the `Linear` layers. This basically means just replace the `Linear` layer with this custom `PackedLinear` class, where `weight` is the `fp8` tensor, and `weight_scale` is the scale from previous sections.
156154

@@ -162,7 +160,7 @@ class PackedLinear(torch.nn.Module):
162160
self.weight_scale = torch.nn.Parameter(weight_scale, requires_grad=False)
163161
```
164162

165-
## Update model config
163+
### Update model config
166164

167165
This part is really easy, just add a `quantization_config` into the model's config. This will also appear in the `config.json` file in the huggingface repo of the model.
168166

0 commit comments

Comments
 (0)