新文件： doc/tutorial/torchao/index.md

xinetzone · Oct 8, 2024 · a0d654b · a0d654b
1 parent 4cfb4b0
commit a0d654b
Show file tree

Hide file tree

Showing 3 changed files with 199 additions and 0 deletions.
diff --git a/doc/tutorial/index.md b/doc/tutorial/index.md
@@ -14,5 +14,6 @@ Detectron2/index
 torchmetrics/index
 ddp/index
 executorch/index
+torchao/index
 chaos/index
 ```
diff --git a/doc/tutorial/torchao/index.md b/doc/tutorial/torchao/index.md
@@ -0,0 +1,5 @@
+#  PyTorch 原生架构优化：`torchao`
+
+```{toctree}
+intro
+```
diff --git a/doc/tutorial/torchao/intro.ipynb b/doc/tutorial/torchao/intro.ipynb
@@ -0,0 +1,193 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# `torchao` 概述\n",
+    "\n",
+    "原文：[pytorch-native-architecture-optimization](https://pytorch.org/blog/pytorch-native-architecture-optimization/)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "[torchao](https://github.com/pytorch/ao) 是 PyTorch 原生库，通过利用低位宽数据类型、量化和稀疏性，使模型更快更小。`torchao` 是一个易于访问的工具包，包含（主要是）用易于阅读的 PyTorch 代码编写的技术，涵盖推理和训练两个方面。\n",
+    "\n",
+    "除非另有说明，基线是在 A100 80GB GPU 上运行的 bf16。\n",
+    "\n",
+    "针对 LLama 3 的主要指标包括：\n",
+    "\n",
+    "- 使用 `autoquant` 和仅 `int4` 权重量化加 `hqq`，使 LLama 3 8B 推理速度提升 $97\\%$。\n",
+    "- 在 128K 上下文长度下，使用量化 KV 缓存，使 LLama 3.1 8B 推理的峰值 VRAM 减少 $73\\%$。\n",
+    "- 使用 `float8` 训练在 H100 上进行 LLama 3 70B 预训练，速度提升 $50\\%$。\n",
+    "- 使用 4 比特量化优化器，使 LLama 3 8B 的峰值 VRAM 减少 $30\\%$。\n",
+    "\n",
+    "针对扩散模型推理的主要指标包括：\n",
+    "\n",
+    "- 在 `flux1.dev` 上使用 float8 动态量化推理和 float8 逐行缩放，在 H100 上速度提升 $53\\%$。\n",
+    "- 对于 `CogVideoX`，使用 `int8` 动态量化使模型 VRAM 减少 $50\\%$。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 推理量化算法\n",
+    "\n",
+    "[推理量化算法](https://github.com/pytorch/ao/tree/main/torchao/quantization)适用于包含 `nn.Linear` 层的任意 PyTorch 模型。通过我们的顶层 API `quantize_`，可以选择仅权重和动态激活量化，支持多种数据类型和稀疏布局。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "from torchao.quantization import (  \n",
+    "    quantize_,  \n",
+    "    int4_weight_only,  \n",
+    ")  \n",
+    "quantize_(model, int4_weight_only())\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "有时，由于开销问题，量化一个层可能会使其变慢。因此，如果你希望我们为你选择如何量化模型中的每一层，那么你可以选择运行\n",
+    "```python\n",
+    "model = torchao.autoquant(torch.compile(model, mode='max-autotune'))\n",
+    "```\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`quantize_` API 根据模型是计算密集型还是内存密集型提供了一些不同的选项。\n",
+    "```python\n",
+    "from torchao.quantization import (  \n",
+    "    # Memory bound models  \n",
+    "    int4_weight_only,  \n",
+    "    int8_weight_only,\n",
+    "\n",
+    "    # Compute bound models  \n",
+    "    int8_dynamic_activation_int8_semi_sparse_weight,  \n",
+    "    int8_dynamic_activation_int8_weight,  \n",
+    "      \n",
+    "    # Device capability 8.9+  \n",
+    "    float8_weight_only,  \n",
+    "    float8_dynamic_activation_float8_weight,  \n",
+    ")\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "API是可组合的，例如我们结合了稀疏性和量化，为 ViT-H 推理带来了 $5\\%$ 的速度提升。\n",
+    "\n",
+    "但我们也可以做一些事情，比如将权重量化为 `int4`，并将 kv 缓存量化为 `int8`，以支持在不到 18.9GB VRAM 下全长度 128K 上下文运行的 Llama 3.1 8B。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## QAT（量化感知训练）\n",
+    "\n",
+    "在 4 比特以下的后训练量化中，准确性可能会严重下降。通过使用[量化感知训练](https://pytorch.org/blog/quantization-aware-training/)（Quantization Aware Training, QAT），我们已经成功恢复了高达 $96\\%$ 的准确性损失。我们将这一方法作为端到端方案集成到了 `torchtune` 中，并附带了一个[简单的教程](https://github.com/pytorch/ao/tree/main/torchao/quantization/prototype/qat)。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 低精度计算和通信\n",
+    "\n",
+    "`torchao`提供易于使用的端到端工作流，用于降低训练计算和分布式通信的精度，从 `torch.nn.Linear` 层的 `float8` 开始。以下是将训练运行的计算 `gemm` 转换为 `float8` 的一行代码：\n",
+    "```python\n",
+    "from torchao.float8 import convert_to_float8_training  \n",
+    "convert_to_float8_training(model)\n",
+    "```\n",
+    "\n",
+    "有关如何通过使用 `float8` 将 LLaMa 3 70B 预训练速度提高多达 1.5 倍的端到端示例，请参阅我们的 [README](https://github.com/pytorch/ao/tree/main/torchao/float8)、[torchtitan 的博客](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359)和 [`float8` 配方](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "我们正在扩展我们的训练工作流以支持更多的数据类型和布局。\n",
+    "\n",
+    "- [在 `torchtune` 中进行 NF4 QLoRA](https://pytorch.org/torchtune/main/tutorials/qlora_finetune.html)\n",
+    "- [原型 `int8` 训练支持](https://github.com/pytorch/ao/pull/748)\n",
+    "- [加速的稀疏 `2:4` 训练](https://pytorch.org/blog/accelerating-neural-network-training/)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 低比特优化器\n",
+    "\n",
+    "受到 Bits and Bytes 的启发，我们还添加了 8 比特和 4 比特优化器的原型支持，作为 `AdamW` 的即插即用替代品。"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```python\n",
+    "from torchao.prototype.low_bit_optim import AdamW8bit, AdamW4bit  \n",
+    "optim = AdamW8bit(model.parameters())\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 集成\n",
+    "\n",
+    "我们一直在积极努力，确保 `torchao` 在开源中一些最重要的项目中能够良好工作。\n",
+    "\n",
+    "- 作为[推理后端的 Huggingface transformers](https://huggingface.co/docs/transformers/main/quantization/torchao)\n",
+    "- 在 [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao) 中作为加速扩散模型的参考实现\n",
+    "- 在 HQQ 中用于[快速 4 比特推理](https://github.com/mobiusml/hqq#faster-inference)\n",
+    "- 在 [`torchtune`](https://github.com/pytorch/torchtune) 中用于 PyTorch 原生 QLoRA 和 QAT 配方\n",
+    "- 在 [`torchchat`](https://github.com/pytorch/torchchat) 中用于后训练量化\n",
+    "- 在 SGLang 中用于 [`int4` 和 `int8` 后训练量化](https://github.com/sgl-project/sglang/pull/1341)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}