Skip to content

Commit

Permalink
新文件: doc/tutorial/torchao/index.md
Browse files Browse the repository at this point in the history
  • Loading branch information
liuxinwei committed Oct 8, 2024
1 parent 4cfb4b0 commit a0d654b
Show file tree
Hide file tree
Showing 3 changed files with 199 additions and 0 deletions.
1 change: 1 addition & 0 deletions doc/tutorial/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,5 +14,6 @@ Detectron2/index
torchmetrics/index
ddp/index
executorch/index
torchao/index
chaos/index
```
5 changes: 5 additions & 0 deletions doc/tutorial/torchao/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# PyTorch 原生架构优化:`torchao`

```{toctree}
intro
```
193 changes: 193 additions & 0 deletions doc/tutorial/torchao/intro.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# `torchao` 概述\n",
"\n",
"原文:[pytorch-native-architecture-optimization](https://pytorch.org/blog/pytorch-native-architecture-optimization/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[torchao](https://github.com/pytorch/ao) 是 PyTorch 原生库,通过利用低位宽数据类型、量化和稀疏性,使模型更快更小。`torchao` 是一个易于访问的工具包,包含(主要是)用易于阅读的 PyTorch 代码编写的技术,涵盖推理和训练两个方面。\n",
"\n",
"除非另有说明,基线是在 A100 80GB GPU 上运行的 bf16。\n",
"\n",
"针对 LLama 3 的主要指标包括:\n",
"\n",
"- 使用 `autoquant` 和仅 `int4` 权重量化加 `hqq`,使 LLama 3 8B 推理速度提升 $97\\%$。\n",
"- 在 128K 上下文长度下,使用量化 KV 缓存,使 LLama 3.1 8B 推理的峰值 VRAM 减少 $73\\%$。\n",
"- 使用 `float8` 训练在 H100 上进行 LLama 3 70B 预训练,速度提升 $50\\%$。\n",
"- 使用 4 比特量化优化器,使 LLama 3 8B 的峰值 VRAM 减少 $30\\%$。\n",
"\n",
"针对扩散模型推理的主要指标包括:\n",
"\n",
"- 在 `flux1.dev` 上使用 float8 动态量化推理和 float8 逐行缩放,在 H100 上速度提升 $53\\%$。\n",
"- 对于 `CogVideoX`,使用 `int8` 动态量化使模型 VRAM 减少 $50\\%$。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 推理量化算法\n",
"\n",
"[推理量化算法](https://github.com/pytorch/ao/tree/main/torchao/quantization)适用于包含 `nn.Linear` 层的任意 PyTorch 模型。通过我们的顶层 API `quantize_`,可以选择仅权重和动态激活量化,支持多种数据类型和稀疏布局。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"from torchao.quantization import ( \n",
" quantize_, \n",
" int4_weight_only, \n",
") \n",
"quantize_(model, int4_weight_only())\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"有时,由于开销问题,量化一个层可能会使其变慢。因此,如果你希望我们为你选择如何量化模型中的每一层,那么你可以选择运行\n",
"```python\n",
"model = torchao.autoquant(torch.compile(model, mode='max-autotune'))\n",
"```\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`quantize_` API 根据模型是计算密集型还是内存密集型提供了一些不同的选项。\n",
"```python\n",
"from torchao.quantization import ( \n",
" # Memory bound models \n",
" int4_weight_only, \n",
" int8_weight_only,\n",
"\n",
" # Compute bound models \n",
" int8_dynamic_activation_int8_semi_sparse_weight, \n",
" int8_dynamic_activation_int8_weight, \n",
" \n",
" # Device capability 8.9+ \n",
" float8_weight_only, \n",
" float8_dynamic_activation_float8_weight, \n",
")\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"API是可组合的,例如我们结合了稀疏性和量化,为 ViT-H 推理带来了 $5\\%$ 的速度提升。\n",
"\n",
"但我们也可以做一些事情,比如将权重量化为 `int4`,并将 kv 缓存量化为 `int8`,以支持在不到 18.9GB VRAM 下全长度 128K 上下文运行的 Llama 3.1 8B。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## QAT(量化感知训练)\n",
"\n",
"在 4 比特以下的后训练量化中,准确性可能会严重下降。通过使用[量化感知训练](https://pytorch.org/blog/quantization-aware-training/)(Quantization Aware Training, QAT),我们已经成功恢复了高达 $96\\%$ 的准确性损失。我们将这一方法作为端到端方案集成到了 `torchtune` 中,并附带了一个[简单的教程](https://github.com/pytorch/ao/tree/main/torchao/quantization/prototype/qat)。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 低精度计算和通信\n",
"\n",
"`torchao`提供易于使用的端到端工作流,用于降低训练计算和分布式通信的精度,从 `torch.nn.Linear` 层的 `float8` 开始。以下是将训练运行的计算 `gemm` 转换为 `float8` 的一行代码:\n",
"```python\n",
"from torchao.float8 import convert_to_float8_training \n",
"convert_to_float8_training(model)\n",
"```\n",
"\n",
"有关如何通过使用 `float8` 将 LLaMa 3 70B 预训练速度提高多达 1.5 倍的端到端示例,请参阅我们的 [README](https://github.com/pytorch/ao/tree/main/torchao/float8)、[torchtitan 的博客](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359)和 [`float8` 配方](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"我们正在扩展我们的训练工作流以支持更多的数据类型和布局。\n",
"\n",
"- [在 `torchtune` 中进行 NF4 QLoRA](https://pytorch.org/torchtune/main/tutorials/qlora_finetune.html)\n",
"- [原型 `int8` 训练支持](https://github.com/pytorch/ao/pull/748)\n",
"- [加速的稀疏 `2:4` 训练](https://pytorch.org/blog/accelerating-neural-network-training/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 低比特优化器\n",
"\n",
"受到 Bits and Bytes 的启发,我们还添加了 8 比特和 4 比特优化器的原型支持,作为 `AdamW` 的即插即用替代品。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```python\n",
"from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit \n",
"optim = AdamW8bit(model.parameters())\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 集成\n",
"\n",
"我们一直在积极努力,确保 `torchao` 在开源中一些最重要的项目中能够良好工作。\n",
"\n",
"- 作为[推理后端的 Huggingface transformers](https://huggingface.co/docs/transformers/main/quantization/torchao)\n",
"- 在 [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao) 中作为加速扩散模型的参考实现\n",
"- 在 HQQ 中用于[快速 4 比特推理](https://github.com/mobiusml/hqq#faster-inference)\n",
"- 在 [`torchtune`](https://github.com/pytorch/torchtune) 中用于 PyTorch 原生 QLoRA 和 QAT 配方\n",
"- 在 [`torchchat`](https://github.com/pytorch/torchchat) 中用于后训练量化\n",
"- 在 SGLang 中用于 [`int4` 和 `int8` 后训练量化](https://github.com/sgl-project/sglang/pull/1341)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

0 comments on commit a0d654b

Please sign in to comment.