generated from xinetzone/xyzstyle
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
liuxinwei
committed
Oct 8, 2024
1 parent
4cfb4b0
commit a0d654b
Showing
3 changed files
with
199 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -14,5 +14,6 @@ Detectron2/index | |
torchmetrics/index | ||
ddp/index | ||
executorch/index | ||
torchao/index | ||
chaos/index | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# PyTorch 原生架构优化:`torchao` | ||
|
||
```{toctree} | ||
intro | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# `torchao` 概述\n", | ||
"\n", | ||
"原文:[pytorch-native-architecture-optimization](https://pytorch.org/blog/pytorch-native-architecture-optimization/)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"[torchao](https://github.com/pytorch/ao) 是 PyTorch 原生库,通过利用低位宽数据类型、量化和稀疏性,使模型更快更小。`torchao` 是一个易于访问的工具包,包含(主要是)用易于阅读的 PyTorch 代码编写的技术,涵盖推理和训练两个方面。\n", | ||
"\n", | ||
"除非另有说明,基线是在 A100 80GB GPU 上运行的 bf16。\n", | ||
"\n", | ||
"针对 LLama 3 的主要指标包括:\n", | ||
"\n", | ||
"- 使用 `autoquant` 和仅 `int4` 权重量化加 `hqq`,使 LLama 3 8B 推理速度提升 $97\\%$。\n", | ||
"- 在 128K 上下文长度下,使用量化 KV 缓存,使 LLama 3.1 8B 推理的峰值 VRAM 减少 $73\\%$。\n", | ||
"- 使用 `float8` 训练在 H100 上进行 LLama 3 70B 预训练,速度提升 $50\\%$。\n", | ||
"- 使用 4 比特量化优化器,使 LLama 3 8B 的峰值 VRAM 减少 $30\\%$。\n", | ||
"\n", | ||
"针对扩散模型推理的主要指标包括:\n", | ||
"\n", | ||
"- 在 `flux1.dev` 上使用 float8 动态量化推理和 float8 逐行缩放,在 H100 上速度提升 $53\\%$。\n", | ||
"- 对于 `CogVideoX`,使用 `int8` 动态量化使模型 VRAM 减少 $50\\%$。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 推理量化算法\n", | ||
"\n", | ||
"[推理量化算法](https://github.com/pytorch/ao/tree/main/torchao/quantization)适用于包含 `nn.Linear` 层的任意 PyTorch 模型。通过我们的顶层 API `quantize_`,可以选择仅权重和动态激活量化,支持多种数据类型和稀疏布局。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"```python\n", | ||
"from torchao.quantization import ( \n", | ||
" quantize_, \n", | ||
" int4_weight_only, \n", | ||
") \n", | ||
"quantize_(model, int4_weight_only())\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"有时,由于开销问题,量化一个层可能会使其变慢。因此,如果你希望我们为你选择如何量化模型中的每一层,那么你可以选择运行\n", | ||
"```python\n", | ||
"model = torchao.autoquant(torch.compile(model, mode='max-autotune'))\n", | ||
"```\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"`quantize_` API 根据模型是计算密集型还是内存密集型提供了一些不同的选项。\n", | ||
"```python\n", | ||
"from torchao.quantization import ( \n", | ||
" # Memory bound models \n", | ||
" int4_weight_only, \n", | ||
" int8_weight_only,\n", | ||
"\n", | ||
" # Compute bound models \n", | ||
" int8_dynamic_activation_int8_semi_sparse_weight, \n", | ||
" int8_dynamic_activation_int8_weight, \n", | ||
" \n", | ||
" # Device capability 8.9+ \n", | ||
" float8_weight_only, \n", | ||
" float8_dynamic_activation_float8_weight, \n", | ||
")\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"API是可组合的,例如我们结合了稀疏性和量化,为 ViT-H 推理带来了 $5\\%$ 的速度提升。\n", | ||
"\n", | ||
"但我们也可以做一些事情,比如将权重量化为 `int4`,并将 kv 缓存量化为 `int8`,以支持在不到 18.9GB VRAM 下全长度 128K 上下文运行的 Llama 3.1 8B。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## QAT(量化感知训练)\n", | ||
"\n", | ||
"在 4 比特以下的后训练量化中,准确性可能会严重下降。通过使用[量化感知训练](https://pytorch.org/blog/quantization-aware-training/)(Quantization Aware Training, QAT),我们已经成功恢复了高达 $96\\%$ 的准确性损失。我们将这一方法作为端到端方案集成到了 `torchtune` 中,并附带了一个[简单的教程](https://github.com/pytorch/ao/tree/main/torchao/quantization/prototype/qat)。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 低精度计算和通信\n", | ||
"\n", | ||
"`torchao`提供易于使用的端到端工作流,用于降低训练计算和分布式通信的精度,从 `torch.nn.Linear` 层的 `float8` 开始。以下是将训练运行的计算 `gemm` 转换为 `float8` 的一行代码:\n", | ||
"```python\n", | ||
"from torchao.float8 import convert_to_float8_training \n", | ||
"convert_to_float8_training(model)\n", | ||
"```\n", | ||
"\n", | ||
"有关如何通过使用 `float8` 将 LLaMa 3 70B 预训练速度提高多达 1.5 倍的端到端示例,请参阅我们的 [README](https://github.com/pytorch/ao/tree/main/torchao/float8)、[torchtitan 的博客](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359)和 [`float8` 配方](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"我们正在扩展我们的训练工作流以支持更多的数据类型和布局。\n", | ||
"\n", | ||
"- [在 `torchtune` 中进行 NF4 QLoRA](https://pytorch.org/torchtune/main/tutorials/qlora_finetune.html)\n", | ||
"- [原型 `int8` 训练支持](https://github.com/pytorch/ao/pull/748)\n", | ||
"- [加速的稀疏 `2:4` 训练](https://pytorch.org/blog/accelerating-neural-network-training/)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 低比特优化器\n", | ||
"\n", | ||
"受到 Bits and Bytes 的启发,我们还添加了 8 比特和 4 比特优化器的原型支持,作为 `AdamW` 的即插即用替代品。" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"```python\n", | ||
"from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit \n", | ||
"optim = AdamW8bit(model.parameters())\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## 集成\n", | ||
"\n", | ||
"我们一直在积极努力,确保 `torchao` 在开源中一些最重要的项目中能够良好工作。\n", | ||
"\n", | ||
"- 作为[推理后端的 Huggingface transformers](https://huggingface.co/docs/transformers/main/quantization/torchao)\n", | ||
"- 在 [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao) 中作为加速扩散模型的参考实现\n", | ||
"- 在 HQQ 中用于[快速 4 比特推理](https://github.com/mobiusml/hqq#faster-inference)\n", | ||
"- 在 [`torchtune`](https://github.com/pytorch/torchtune) 中用于 PyTorch 原生 QLoRA 和 QAT 配方\n", | ||
"- 在 [`torchchat`](https://github.com/pytorch/torchchat) 中用于后训练量化\n", | ||
"- 在 SGLang 中用于 [`int4` 和 `int8` 后训练量化](https://github.com/sgl-project/sglang/pull/1341)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.12.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |