diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml index 4466865d4..959b2eba8 100644 --- a/.gitlab-ci.yml +++ b/.gitlab-ci.yml @@ -4,6 +4,7 @@ image: python:alpine before_script: - pip install -i https://pypi.tuna.tsinghua.edu.cn/simple mkdocs && mkdocs --version - pip install -i https://pypi.tuna.tsinghua.edu.cn/simple mkdocs-material + - pip install -i https://pypi.tuna.tsinghua.edu.cn/simple mkdocs-minify-plugin # CI 拉取 submodules variables: diff --git a/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#conclusion.md b/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#conclusion.md deleted file mode 100755 index ed3ebeb1d..000000000 --- a/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#conclusion.md +++ /dev/null @@ -1,592 +0,0 @@ -# (测试版)通过缩放点积注意力 (SDPA) 实现高性能 Transformer [¶](#beta-implementing-high-performance-transformers-with-scaled-dot-product-attention- sdpa"此标题的永久链接") - - -> 译者:[片刻小哥哥](https://github.com/jiangzhonglian) -> -> 项目地址: -> -> 原始地址: - - - - -**作者:** -[Driss Guessous](https://github.com/drisspg) - - - - - -## 摘要 [¶](#summary "此标题的永久链接") - - - - - 在本教程中,我们想要重点介绍一个新的 - `torch.nn.function` - 函数,它有助于实现 Transformer 架构。该函数名为 - `torch.nn.function.scaled_dot_product_attention` - 。 -有关该函数的详细说明,请参阅 - [PyTorch 文档](https://pytorch.org/docs/master/generated/torch.nn.function.scaled_dot_product_attention.html#torch.nn.function.scaled_dot_product_attention) -. -此函数已合并到 - `torch.nn.MultiheadAttention` - 和 - `torch.nn.TransformerEncoderLayer` -. - - - - - -## 概述 [¶](#overview "此标题的永久链接") - - - - - 在较高层面上,此 PyTorch 函数根据 -论文中的定义计算查询、键和值之间的 -缩放点积注意力 (SDPA) - [注意力就是您所需要的](https://arxiv.org/abs/1706.03762) - 。虽然可以使用现有函数在 PyTorch 中编写此函数,但融合实现可以比原始实现提供更大的性能优势。 - - - - - -## 融合实现 [¶](#fused-implementations "永久链接到此标题") - - - - - 对于 CUDA tensor输入,该函数将分派到以下实现之一 -: - - - -* [FlashAttention:具有 IO 感知的快速、内存高效的精确注意力](https://arxiv.org/abs/2205.14135) -* [内存高效的注意力](https://github.com/facebookresearch/xformers ) -* 用 C++ 定义的 PyTorch 实现 - - - - - 注意 - - - - - 本教程需要 PyTorch 2.0.0 或更高版本。 - - - - - - - -``` -import torch -import torch.nn as nn -import torch.nn.functional as F -device = "cuda" if torch.cuda.is_available() else "cpu" - -# Example Usage: -query, key, value = torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device) -F.scaled_dot_product_attention(query, key, value) - -``` - - - - - - -``` -tensor([[[-1.3321, -0.3489, 0.3015, -0.3912, 0.9867, 0.3137, -0.0691, - -1.2593], - [-1.0882, 0.2506, 0.6491, 0.1360, 0.5238, -0.2448, -0.0820, - -0.6171], - [-1.0012, 0.3990, 0.6441, -0.0277, 0.5325, -0.2564, -0.0607, - -0.6404]], - - [[ 0.6091, 0.0708, 0.6188, 0.3252, -0.1598, 0.4197, -0.2335, - 0.0630], - [ 0.5285, 0.3890, -0.2649, 0.3706, -0.3839, 0.1963, -0.6242, - 0.2312], - [ 0.4048, 0.0762, 0.3777, 0.4689, -0.2978, 0.2754, -0.6429, - 0.1037]]], device='cuda:0') - -``` - - - - - -## 显式调度程序控制 [¶](#explicit-dispatcher-control "永久链接到此标题") - - - - - 虽然该函数将隐式分派到三个 -实现之一,但用户还可以通过使用上下文管理器 -显式控制分派。此上下文管理器允许用户 -显式禁用某些实现。如果用户想要确保 -该函数确实对其特定输入使用 -最快的实现, -可以使用上下文管理器来扫描 -测量性能。 - - - - - - -``` -# Lets define a helpful benchmarking function: -import torch.utils.benchmark as benchmark -def benchmark_torch_function_in_microseconds(f, *args, **kwargs): - t0 = benchmark.Timer( - stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f} - ) - return t0.blocked_autorange().mean * 1e6 - -# Lets define the hyper-parameters of our input -batch_size = 32 -max_sequence_len = 1024 -num_heads = 32 -embed_dimension = 32 - -dtype = torch.float16 - -query = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype) -key = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype) -value = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype) - -print(f"The default implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") - -# Lets explore the speed of each of the 3 implementations -from torch.backends.cuda import sdp_kernel, SDPBackend - -# Helpful arguments mapper -backend_map = { - SDPBackend.MATH: {"enable_math": True, "enable_flash": False, "enable_mem_efficient": False}, - SDPBackend.FLASH_ATTENTION: {"enable_math": False, "enable_flash": True, "enable_mem_efficient": False}, - SDPBackend.EFFICIENT_ATTENTION: { - "enable_math": False, "enable_flash": False, "enable_mem_efficient": True} -} - -with sdp_kernel(**backend_map[SDPBackend.MATH]): - print(f"The math implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") - - -with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]): - try: - print(f"The flash attention implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") - except RuntimeError: - print("FlashAttention is not supported. See warnings for reasons.") - -with sdp_kernel(**backend_map[SDPBackend.EFFICIENT_ATTENTION]): - try: - print(f"The memory efficient implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") - except RuntimeError: - print("EfficientAttention is not supported. See warnings for reasons.") - -``` - - - - - - -``` -The default implementation runs in 4741.745 microseconds -The math implementation runs in 19249.446 microseconds -The flash attention implementation runs in 4741.583 microseconds -The memory efficient implementation runs in 4193.383 microseconds - -``` - - - - - -## 硬件依赖性 [¶](#hardware-dependence "永久链接到此标题") - - - - - 根据您运行上述单元的机器以及可用的硬件,您的结果可能会有所不同。 -- 如果您没有’ 没有 GPU 并且在 CPU 上运行,则上下文管理器\ n 将没有任何效果,并且所有三个运行都应返回相似的计时。 -- 取决于您的显卡支持的计算能力 -闪存关注或内存效率可能会失败。 - - - - - -## 因果自注意力 [¶](#causal-self-attention "永久链接到此标题") - - - - - 下面是一个多头因果自我注意力块的示例实现,灵感来自于 - [Andrej Karpathy NanoGPT](https://github.com/karpathy/nanoGPT) - 存储库。 - - - - - - -``` -class CausalSelfAttention(nn.Module): - - def __init__(self, num_heads: int, embed_dimension: int, bias: bool=False, is_causal: bool=False, dropout:float=0.0): - super().__init__() - assert embed_dimension % num_heads == 0 - # key, query, value projections for all heads, but in a batch - self.c_attn = nn.Linear(embed_dimension, 3 * embed_dimension, bias=bias) - # output projection - self.c_proj = nn.Linear(embed_dimension, embed_dimension, bias=bias) - # regularization - self.dropout = dropout - self.resid_dropout = nn.Dropout(dropout) - self.num_heads = num_heads - self.embed_dimension = embed_dimension - # Perform causal masking - self.is_causal = is_causal - - def forward(self, x): - # calculate query, key, values for all heads in batch and move head forward to be the batch dim - query_projected = self.c_attn(x) - - batch_size = query_projected.size(0) - embed_dim = query_projected.size(2) - head_dim = embed_dim // (self.num_heads * 3) - - query, key, value = query_projected.chunk(3, -1) - query = query.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2) - key = key.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2) - value = value.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2) - - if self.training: - dropout = self.dropout - is_causal = self.is_causal - else: - dropout = 0.0 - is_causal = False - - y = F.scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=dropout, is_causal=is_causal) - y = y.transpose(1, 2).view(batch_size, -1, self.num_heads * head_dim) - - y = self.resid_dropout(self.c_proj(y)) - return y - - -num_heads = 8 -heads_per_dim = 64 -embed_dimension = num_heads * heads_per_dim -dtype = torch.float16 -model = CausalSelfAttention(num_heads=num_heads, embed_dimension=embed_dimension, bias=False, is_causal=True, dropout=0.1).to("cuda").to(dtype).eval() -print(model) - -``` - - - - - - -``` -CausalSelfAttention( - (c_attn): Linear(in_features=512, out_features=1536, bias=False) - (c_proj): Linear(in_features=512, out_features=512, bias=False) - (resid_dropout): Dropout(p=0.1, inplace=False) -) - -``` - - - - -### `NestedTensor` - 和密集tensor支持 [¶](#nestedtensor-and-dense-tensor-support "永久链接到此标题") - - - - SDPA 支持 - `NestedTensor` - 和密集tensor输入。 - `NestedTensor` - 处理输入是一批可变长度序列的情况 -无需将每个序列填充到最大长度批。有关 - `NestedTensors` 的更多信息,请参阅 - [torch.nested](https://pytorch.org/docs/stable/nested.html) - 和 - [NestedTensors 教程](https://pytorch.org/tutorials/prototype/nestedtensor.html) -. - - - - - - -``` -import random -def generate_rand_batch( - batch_size, - max_sequence_len, - embed_dimension, - pad_percentage=None, - dtype=torch.float16, - device="cuda", -): - if not pad_percentage: - return ( - torch.randn( - batch_size, - max_sequence_len, - embed_dimension, - dtype=dtype, - device=device, - ), - None, - ) - # Random sequence lengths - seq_len_list = [ - int(max_sequence_len * (1 - random.gauss(pad_percentage, 0.01))) - for _ in range(batch_size) - ] - # Make random entry in the batch have max sequence length - seq_len_list[random.randint(0, batch_size - 1)] = max_sequence_len - return ( - torch.nested.nested_tensor( - - [torch.randn(seq_len, embed_dimension, - dtype=dtype, device=device) - for seq_len in seq_len_list - ] - ), - seq_len_list, - ) - -random_nt, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=0.5, dtype=dtype, device=device) -random_dense, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=None, dtype=dtype, device=device) - -# Currently the fused implementations don't support ``NestedTensor`` for training -model.eval() - -with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]): - try: - print(f"Random NT runs in {benchmark_torch_function_in_microseconds(model, random_nt):.3f} microseconds") - print(f"Random Dense runs in {benchmark_torch_function_in_microseconds(model, random_dense):.3f} microseconds") - except RuntimeError: - print("FlashAttention is not supported. See warnings for reasons.") - -``` - - - - - - -``` -/var/lib/jenkins/workspace/intermediate_source/scaled_dot_product_attention_tutorial.py:226: UserWarning: - -The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.) - -Random NT runs in 679.281 microseconds -Random Dense runs in 1183.933 microseconds - -``` - - - - - - - - -# 使用 SDPA 与 - `torch.compile` [¶](#using-sdpa-with-torch-compile "永久链接到此标题") - - - - 随着 PyTorch 2.0 的发布,引入了一项名为 - `torch.compile()` - 的新功能,与 eager 模式相比 -它可以提供 -显着的性能改进。 -缩放点积注意力完全可以与 -组合`torch.compile()` - 。 -为了演示这一点,让’s 使用 - `CausalSelfAttention` - 模块编译 - `torch.compile()` - 并观察由此产生的性能改进. - - - - - - -``` -batch_size = 32 -max_sequence_len = 256 -x = torch.rand(batch_size, max_sequence_len, - embed_dimension, device=device, dtype=dtype) -print( - f"The non compiled module runs in {benchmark_torch_function_in_microseconds(model, x):.3f} microseconds") - - -compiled_model = torch.compile(model) -# Let's compile it -compiled_model(x) -print( - f"The compiled module runs in {benchmark_torch_function_in_microseconds(compiled_model, x):.3f} microseconds") - -``` - - - - - - -``` -The non compiled module runs in 416.696 microseconds -The compiled module runs in 453.513 microseconds - -``` - - - - - 确切的执行时间取决于机器,但是我的结果: -未编译的模块在 166.616 微秒内运行 -编译的模块在 166.726 微秒内运行 -这不是我们所期望的。让’s 更深入地挖掘一下。 -PyTorch 附带了一个令人惊叹的内置分析器,您可以使用它 -检查代码的性能特征。 - - - - - - -``` -from torch.profiler import profile, record_function, ProfilerActivity -activities = [ProfilerActivity.CPU] -if device == 'cuda': - activities.append(ProfilerActivity.CUDA) - -with profile(activities=activities, record_shapes=False) as prof: - with record_function(" Non-Compilied Causal Attention"): - for _ in range(25): - model(x) -print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) - - -with profile(activities=activities, record_shapes=False) as prof: - with record_function("Compiled Causal Attention"): - for _ in range(25): - compiled_model(x) -print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) - -# For even more insights, you can export the trace and use ``chrome://tracing`` to view the results -# :: -# -# prof.export_chrome_trace("compiled_causal_attention_trace.json"). - -``` - - - - - - -``` -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Non-Compilied Causal Attention 16.91% 1.981ms 70.42% 8.250ms 8.250ms 0.000us 0.00% 11.013ms 11.013ms 1 - aten::matmul 2.48% 291.000us 26.92% 3.154ms 63.080us 0.000us 0.00% 8.378ms 167.560us 50 - aten::mm 18.89% 2.213ms 22.68% 2.657ms 53.140us 7.743ms 74.61% 8.378ms 167.560us 50 - aten::linear 2.50% 293.000us 30.21% 3.539ms 70.780us 0.000us 0.00% 7.893ms 157.860us 50 - ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn 0.00% 0.000us 0.00% 0.000us 0.000us 5.550ms 53.48% 5.550ms 222.000us 25 - aten::scaled_dot_product_attention 1.85% 217.000us 14.66% 1.718ms 68.720us 0.000us 0.00% 2.635ms 105.400us 25 - aten::_scaled_dot_product_efficient_attention 3.61% 423.000us 12.81% 1.501ms 60.040us 0.000us 0.00% 2.635ms 105.400us 25 - aten::_efficient_attention_forward 3.36% 394.000us 8.33% 976.000us 39.040us 2.635ms 25.39% 2.635ms 105.400us 25 -fmha_cutlassF_f16_aligned_64x64_rf_sm80(PyTorchMemEf... 0.00% 0.000us 0.00% 0.000us 0.000us 2.635ms 25.39% 2.635ms 105.400us 25 -ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_3... 0.00% 0.000us 0.00% 0.000us 0.000us 2.193ms 21.13% 2.193ms 87.720us 25 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 11.715ms -Self CUDA time total: 10.378ms - -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Compiled Causal Attention 14.58% 1.889ms 90.02% 11.660ms 11.660ms 0.000us 0.00% 12.187ms 12.187ms 1 - CompiledFunction 37.96% 4.916ms 66.21% 8.575ms 343.000us 0.000us 0.00% 12.187ms 487.480us 25 - aten::mm 6.82% 883.000us 10.76% 1.393ms 27.860us 7.767ms 68.85% 8.306ms 166.120us 50 - ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn 0.00% 0.000us 0.00% 0.000us 0.000us 5.572ms 49.39% 5.572ms 222.880us 25 - aten::_scaled_dot_product_efficient_attention 2.01% 260.000us 10.57% 1.369ms 54.760us 0.000us 0.00% 2.867ms 114.680us 25 - aten::_efficient_attention_forward 3.08% 399.000us 7.42% 961.000us 38.440us 2.639ms 23.39% 2.867ms 114.680us 25 -fmha_cutlassF_f16_aligned_64x64_rf_sm80(PyTorchMemEf... 0.00% 0.000us 0.00% 0.000us 0.000us 2.639ms 23.39% 2.639ms 105.560us 25 -ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_3... 0.00% 0.000us 0.00% 0.000us 0.000us 2.195ms 19.46% 2.195ms 87.800us 25 - triton_poi_fused_clone_0 2.84% 368.000us 3.92% 508.000us 20.320us 875.000us 7.76% 1.014ms 40.560us 25 - triton__0d1de 0.00% 0.000us 0.00% 0.000us 0.000us 875.000us 7.76% 875.000us 35.000us 25 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 12.952ms -Self CUDA time total: 11.281ms - -``` - - - - - 前面的代码片段生成了编译模块和非编译模块中消耗最多 GPU 执行时间的前 10 个 PyTorch 函数的报告。 -分析表明,花费在 GPU 上的大部分时间是两个模块集中 -相同的函数集。 -原因是 - `torch.compile` -非常擅长消除 -与 PyTorch 相关的框架开销。如果您的模型正在启动大型、高效的 CUDA 内核(在本例中就是“CausalSelfAttention”),则可以隐藏 PyTorch 的开销。 - - - - - 实际上,您的模块通常不包含单个 - `CausalSelfAttention` - 块。在使用 [Andrej Karpathy NanoGPT](https://github.com/karpathy/nanoGPT) 存储库进行实验时,编译 -模块每个训练步骤的时间从: - `6090.49ms` - 到 - `3273.17ms` -!这是在 Shakespeare 数据集上的 NanoGPT 训练提交时完成的: - `ae3a8d5` -。 - - - - - -# 结论 [¶](#conclusion "永久链接到此标题") - - - - 在本教程中,我们演示了 - `torch.nn.function.scaled_dot_product_attention` - 的基本用法。我们已经展示了如何使用 -`sdp_kernel` - 上下文管理器来断言在 GPU 上使用了某个 -实现。此外,我们还构建了一个简单的“CausalSelfAttention”模块,该模块可与“NestedTensor”配合使用,并且可进行 torch 编译。在此过程中,我们展示了如何使用分析工具 -来探索用户定义 -模块的性能特征。 - - - - -**脚本的总运行时间:** - ( 0 分 8.239 秒) diff --git a/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#using-sdpa-with-attn-bias-subclasses.md b/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#using-sdpa-with-attn-bias-subclasses.md deleted file mode 100755 index 1132bfe71..000000000 --- a/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#using-sdpa-with-attn-bias-subclasses.md +++ /dev/null @@ -1,417 +0,0 @@ -# (测试版)使用缩放点积注意力(SDPA)实现高性能Transformers [¶](#beta-implementing-high-performance-transformers-with-scaled-dot-product-attention- sdpa"此标题的永久链接") - -> 译者:[liuenci](https://github.com/liuenci) -> -> 项目地址: -> -> 原始地址: - -**作者**: [Driss Guessous](https://github.com/drisspg) - -# 摘要 [¶](#summary "此标题的永久链接") -在本教程中,我们将介绍一个新的torch.nn.functional函数,它对于实现 Transformers 架构非常有帮助。这个函数名为torch.nn.functional.scaled_dot_product_attention。有关该函数的详细描述,请参阅[PyTorch 文档](https://pytorch.org/docs/master/generated/torch.nn.function.scaled_dot_product_attention.html#torch.nn.function.scaled_dot_product_attention) 。此函数已经被整合到torch.nn.MultiheadAttention和torch.nn.TransformerEncoderLayer中。 - -# 概述 [¶](#overview "此标题的永久链接") -从深层次来看,这个PyTorch函数根据论文《Attention is all you need》中的定义,计算查询(query)、键(key)和值(value)之间的缩放点积注意力(SDPA)。虽然这个函数可以使用现有的PyTorch函数编写,但一个融合实现(fused implementation)可以比朴素实现提供更大的性能优势。 - -# 融合实现 [¶](#fused-implementations "永久链接到此标题") -对于CUDA张量输入,该函数将分派到以下实现之一: -1. **FlashAttention**:这是一种快速且内存高效的精确注意力机制,具有IO感知能力。这种实现优化了计算速度,并考虑到输入/输出操作对性能的影响。 -2. **内存高效注意力**:这种实现旨在减少在执行缩放点积注意力时所需的内存占用,这对于处理大型模型或长序列尤为重要。 -3. **C++中定义的PyTorch实现**:这指的是在C++中编写的PyTorch函数实现,通常用于提高性能,因为C++编写的代码可以直接与底层硬件进行交互,从而优化计算效率。 - -``` -本教程需要PyTorch 2.0.0或更高版本。 -``` - -```py -import torch -import torch.nn as nn -import torch.nn.functional as F -device = "cuda" if torch.cuda.is_available() else "cpu" - -# Example Usage: -query, key, value = torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device) -F.scaled_dot_product_attention(query, key, value) -``` - -``` -tensor([[[-1.3321, -0.3489, 0.3015, -0.3912, 0.9867, 0.3137, -0.0691, - -1.2593], - [-1.0882, 0.2506, 0.6491, 0.1360, 0.5238, -0.2448, -0.0820, - -0.6171], - [-1.0012, 0.3990, 0.6441, -0.0277, 0.5325, -0.2564, -0.0607, - -0.6404]], - - [[ 0.6091, 0.0708, 0.6188, 0.3252, -0.1598, 0.4197, -0.2335, - 0.0630], - [ 0.5285, 0.3890, -0.2649, 0.3706, -0.3839, 0.1963, -0.6242, - 0.2312], - [ 0.4048, 0.0762, 0.3777, 0.4689, -0.2978, 0.2754, -0.6429, - 0.1037]]], device='cuda:0') -``` - -# 显式调度器控制 [¶](#explicit-dispatcher-control "永久链接到此标题") -虽然该函数会隐式地分派到三种实现之一,但用户也可以通过使用上下文管理器(context manager)来显式控制分派。这个上下文管理器允许用户显式禁用某些实现。如果用户想确保函数确实针对他们的特定输入使用最快的实现,可以使用上下文管理器来遍历并测量性能。 -```py -# Lets define a helpful benchmarking function: -import torch.utils.benchmark as benchmark -def benchmark_torch_function_in_microseconds(f, *args, **kwargs): - t0 = benchmark.Timer( - stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f} - ) - return t0.blocked_autorange().mean * 1e6 - -# Lets define the hyper-parameters of our input -batch_size = 32 -max_sequence_len = 1024 -num_heads = 32 -embed_dimension = 32 - -dtype = torch.float16 - -query = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype) -key = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype) -value = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype) - -print(f"The default implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") - -# Lets explore the speed of each of the 3 implementations -from torch.nn.attention import SDPBackend, sdpa_kernel - - -with sdpa_kernel(SDPBackend.MATH): - math_time=benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value) - print(f"The math implementation runs in {math_time:.3f} microseconds") - -with sdpa_kernel(SDPBackend.FLASH_ATTENTION): - try: - flash_time=benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value) - print(f"The flash attention implementation runs in {flash_time:.3f} microseconds") - except RuntimeError: - print("FlashAttention is not supported. See warnings for reasons.") - -with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION): - try: - efficient_time=benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value) - print(f"The memory efficient implementation runs in {efficient_time:.3f} microseconds") - except RuntimeError: - print("EfficientAttention is not supported. See warnings for reasons.") -``` - - -``` -The default implementation runs in 2304.977 microseconds -The math implementation runs in 19249.369 microseconds -The flash attention implementation runs in 2304.600 microseconds -The memory efficient implementation runs in 4197.082 microseconds -``` - -# 硬件依赖性 [¶](#hardware-dependence "永久链接到此标题") -根据您在上面代码单元运行的机器以及可用的硬件,您得到的结果可能会有所不同: -- 如果您没有GPU并且是在CPU上运行,那么上下文管理器将不起作用,三次运行应该返回相似的时间。 -- 根据您的显卡支持的计算能力,FlashAttention或内存高效注意力可能会失败。 - -# 因果自注意力[¶](#causal-self-attention "永久链接到此标题") -下面是一个因果自注意力(multi-headed causal self attention)块的示例实现,灵感来源于Andrej Karpathy的NanoGPT仓库。 - -```py -class CausalSelfAttention(nn.Module): - - def __init__(self, num_heads: int, embed_dimension: int, bias: bool=False, is_causal: bool=False, dropout:float=0.0): - super().__init__() - assert embed_dimension % num_heads == 0 - # key, query, value projections for all heads, but in a batch - self.c_attn = nn.Linear(embed_dimension, 3 * embed_dimension, bias=bias) - # output projection - self.c_proj = nn.Linear(embed_dimension, embed_dimension, bias=bias) - # regularization - self.dropout = dropout - self.resid_dropout = nn.Dropout(dropout) - self.num_heads = num_heads - self.embed_dimension = embed_dimension - # Perform causal masking - self.is_causal = is_causal - - def forward(self, x): - # calculate query, key, values for all heads in batch and move head forward to be the batch dim - query_projected = self.c_attn(x) - - batch_size = query_projected.size(0) - embed_dim = query_projected.size(2) - head_dim = embed_dim // (self.num_heads * 3) - - query, key, value = query_projected.chunk(3, -1) - query = query.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2) - key = key.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2) - value = value.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2) - - if self.training: - dropout = self.dropout - is_causal = self.is_causal - else: - dropout = 0.0 - is_causal = False - - y = F.scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=dropout, is_causal=is_causal) - y = y.transpose(1, 2).view(batch_size, -1, self.num_heads * head_dim) - - y = self.resid_dropout(self.c_proj(y)) - return y - - -num_heads = 8 -heads_per_dim = 64 -embed_dimension = num_heads * heads_per_dim -dtype = torch.float16 -model = CausalSelfAttention(num_heads=num_heads, embed_dimension=embed_dimension, bias=False, is_causal=True, dropout=0.1).to("cuda").to(dtype).eval() -print(model) -``` - - -``` -CausalSelfAttention( - (c_attn): Linear(in_features=512, out_features=1536, bias=False) - (c_proj): Linear(in_features=512, out_features=512, bias=False) - (resid_dropout): Dropout(p=0.1, inplace=False) -) -``` - -# NestedTensor 和 Dense 张量支持 -SDPA支持NestedTensor和Dense张量输入。NestedTensors处理的情况是输入是一个不等长序列的批次,而无需将每个序列填充到批次中的最大长度。有关NestedTensors的更多信息,请参阅torch.nested和NestedTensors教程。 - -```py -import random -def generate_rand_batch( - batch_size, - max_sequence_len, - embed_dimension, - pad_percentage=None, - dtype=torch.float16, - device="cuda", -): - if not pad_percentage: - return ( - torch.randn( - batch_size, - max_sequence_len, - embed_dimension, - dtype=dtype, - device=device, - ), - None, - ) - # Random sequence lengths - seq_len_list = [ - int(max_sequence_len * (1 - random.gauss(pad_percentage, 0.01))) - for _ in range(batch_size) - ] - # Make random entry in the batch have max sequence length - seq_len_list[random.randint(0, batch_size - 1)] = max_sequence_len - return ( - torch.nested.nested_tensor( - [ - torch.randn(seq_len, embed_dimension, - dtype=dtype, device=device) - for seq_len in seq_len_list - ] - ), - seq_len_list, - ) - -random_nt, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=0.5, dtype=dtype, device=device) -random_dense, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=None, dtype=dtype, device=device) - -# Currently the fused implementations don't support ``NestedTensor`` for training -model.eval() - -with sdpa_kernel(SDPBackend.FLASH_ATTENTION): - try: - print(f"Random NT runs in {benchmark_torch_function_in_microseconds(model, random_nt):.3f} microseconds") - print(f"Random Dense runs in {benchmark_torch_function_in_microseconds(model, random_dense):.3f} microseconds") - except RuntimeError: - print("FlashAttention is not supported. See warnings for reasons.") -``` - - -``` -/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py:166: UserWarning: - -The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.) - -Random NT runs in 558.517 microseconds -Random Dense runs in 936.630 microseconds -``` - -# 使用 torch.compile 与 SDPA [¶](#using-sdpa-with-torch-compile "永久链接到此标题") -随着PyTorch 2.0的发布,引入了一个名为torch.compile()的新特性,它可以在急切模式(eager mode)上提供显著性能提升。缩放点积注意力(SDPA)与torch.compile()完全兼容。为了演示这一点,我们将使用torch.compile()编译CausalSelfAttention模块,并观察由此带来的性能提升。 - -``` -batch_size = 32 -max_sequence_len = 256 -x = torch.rand(batch_size, max_sequence_len, - embed_dimension, device=device, dtype=dtype) -print( - f"The non compiled module runs in {benchmark_torch_function_in_microseconds(model, x):.3f} microseconds") - - -compiled_model = torch.compile(model) -# Let's compile it -compiled_model(x) -print( - f"The compiled module runs in {benchmark_torch_function_in_microseconds(compiled_model, x):.3f} microseconds") -``` - - -``` -The non compiled module runs in 408.207 microseconds -The compiled module runs in 516.612 microseconds -``` - -具体的执行时间取决于机器,但我的结果是:未编译的模块运行时间为166.616微秒,编译后的模块运行时间为166.726微秒。这并不是我们期望的结果。让我们深入探究一下。PyTorch内置了一个惊人的性能分析器(profiler),您可以使用它来检查代码的性能特征。 - -```py -from torch.profiler import profile, record_function, ProfilerActivity -activities = [ProfilerActivity.CPU] -if device == 'cuda': - activities.append(ProfilerActivity.CUDA) - -with profile(activities=activities, record_shapes=False) as prof: - with record_function(" Non-Compilied Causal Attention"): - for _ in range(25): - model(x) -print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) - - -with profile(activities=activities, record_shapes=False) as prof: - with record_function("Compiled Causal Attention"): - for _ in range(25): - compiled_model(x) -print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) - -# For even more insights, you can export the trace and use ``chrome://tracing`` to view the results -# -# .. code-block:: python -# -# prof.export_chrome_trace("compiled_causal_attention_trace.json"). -``` - - -``` -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Non-Compilied Causal Attention 20.01% 2.285ms 77.24% 8.821ms 8.821ms 0.000us 0.00% 11.098ms 11.098ms 1 - Non-Compilied Causal Attention 0.00% 0.000us 0.00% 0.000us 0.000us 10.328ms 50.41% 10.328ms 10.328ms 1 - aten::matmul 2.36% 269.000us 27.28% 3.115ms 62.300us 0.000us 0.00% 8.156ms 163.120us 50 - aten::mm 18.72% 2.138ms 22.97% 2.623ms 52.460us 7.750ms 37.83% 8.156ms 163.120us 50 - aten::linear 1.62% 185.000us 30.99% 3.539ms 70.780us 0.000us 0.00% 8.068ms 161.360us 50 - ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn 0.00% 0.000us 0.00% 0.000us 0.000us 5.552ms 27.10% 5.552ms 222.080us 25 - aten::scaled_dot_product_attention 1.97% 225.000us 17.75% 2.027ms 81.080us 0.000us 0.00% 2.942ms 117.680us 25 - aten::_scaled_dot_product_flash_attention 3.38% 386.000us 15.78% 1.802ms 72.080us 0.000us 0.00% 2.942ms 117.680us 25 - aten::_flash_attention_forward 4.45% 508.000us 11.48% 1.311ms 52.440us 2.411ms 11.77% 2.942ms 117.680us 25 -void pytorch_flash::flash_fwd_kernel - -tensor([[ True, False, False, False, False, False, False, False, False, False], - [ True, True, False, False, False, False, False, False, False, False]]) -tensor([[ True, True, True, True, True, True, True, True, True, False], - [ True, True, True, True, True, True, True, True, True, True]]) -``` - -# 结论 -在本教程中,我们演示了torch.nn.functional.scaled_dot_product_attention的基本用法。我们展示了如何使用sdpa_kernel上下文管理器来确保在GPU上使用特定的实现。此外,我们还构建了一个简单的CausalSelfAttention模块,该模块与NestedTensor兼容,并且可以被torch编译。在这个过程中,我们还展示了如何使用性能分析工具来探索用户定义模块的性能特征。 - -脚本总运行时间:(0分钟7.894秒) \ No newline at end of file diff --git a/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#using-sdpa-with-torch-compile.md b/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#using-sdpa-with-torch-compile.md deleted file mode 100755 index ac8ca378a..000000000 --- a/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#using-sdpa-with-torch-compile.md +++ /dev/null @@ -1,592 +0,0 @@ -# (测试版)通过缩放点积注意力 (SDPA) 实现高性能 Transformer [¶](#beta-implementing-high-performance-transformers-with-scaled-dot-product-attention- sdpa"此标题的永久链接") - - -> 译者:[片刻小哥哥](https://github.com/jiangzhonglian) -> -> 项目地址: -> -> 原始地址: - - - - -**作者:** -[Driss Guessous](https://github.com/drisspg) - - - - - -## 摘要 [¶](#summary "此标题的永久链接") - - - - - 在本教程中,我们想要重点介绍一个新的 - `torch.nn.function` - 函数,它有助于实现 Transformer 架构。该函数名为 - `torch.nn.function.scaled_dot_product_attention` - 。 -有关该函数的详细说明,请参阅 - [PyTorch 文档](https://pytorch.org/docs/master/generated/torch.nn.function.scaled_dot_product_attention.html#torch.nn.function.scaled_dot_product_attention) -. -此函数已合并到 - `torch.nn.MultiheadAttention` - 和 - `torch.nn.TransformerEncoderLayer` -. - - - - - -## 概述 [¶](#overview "此标题的永久链接") - - - - - 在较高层面上,此 PyTorch 函数根据 -论文中的定义计算查询、键和值之间的 -缩放点积注意力 (SDPA) - [注意力就是您所需要的](https://arxiv.org/abs/1706.03762) - 。虽然可以使用现有函数在 PyTorch 中编写此函数,但融合实现可以比原始实现提供更大的性能优势。 - - - - - -## 融合实现 [¶](#fused-implementations "永久链接到此标题") - - - - - 对于 CUDA tensor输入,该函数将分派到以下实现之一 -: - - - -* [FlashAttention:具有 IO 感知的快速、内存高效的精确注意力](https://arxiv.org/abs/2205.14135) -* [内存高效的注意力](https://github.com/facebookresearch/xformers ) -* 用 C++ 定义的 PyTorch 实现 - - - - - 注意 - - - - - 本教程需要 PyTorch 2.0.0 或更高版本。 - - - - - - - -``` -import torch -import torch.nn as nn -import torch.nn.functional as F -device = "cuda" if torch.cuda.is_available() else "cpu" - -# Example Usage: -query, key, value = torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device) -F.scaled_dot_product_attention(query, key, value) - -``` - - - - - - -``` -tensor([[[-1.3321, -0.3489, 0.3015, -0.3912, 0.9867, 0.3137, -0.0691, - -1.2593], - [-1.0882, 0.2506, 0.6491, 0.1360, 0.5238, -0.2448, -0.0820, - -0.6171], - [-1.0012, 0.3990, 0.6441, -0.0277, 0.5325, -0.2564, -0.0607, - -0.6404]], - - [[ 0.6091, 0.0708, 0.6188, 0.3252, -0.1598, 0.4197, -0.2335, - 0.0630], - [ 0.5285, 0.3890, -0.2649, 0.3706, -0.3839, 0.1963, -0.6242, - 0.2312], - [ 0.4048, 0.0762, 0.3777, 0.4689, -0.2978, 0.2754, -0.6429, - 0.1037]]], device='cuda:0') - -``` - - - - - -## 显式调度程序控制 [¶](#explicit-dispatcher-control "永久链接到此标题") - - - - - 虽然该函数将隐式分派到三个 -实现之一,但用户还可以通过使用上下文管理器 -显式控制分派。此上下文管理器允许用户 -显式禁用某些实现。如果用户想要确保 -该函数确实对其特定输入使用 -最快的实现, -可以使用上下文管理器来扫描 -测量性能。 - - - - - - -``` -# Lets define a helpful benchmarking function: -import torch.utils.benchmark as benchmark -def benchmark_torch_function_in_microseconds(f, *args, **kwargs): - t0 = benchmark.Timer( - stmt="f(*args, **kwargs)", globals={"args": args, "kwargs": kwargs, "f": f} - ) - return t0.blocked_autorange().mean * 1e6 - -# Lets define the hyper-parameters of our input -batch_size = 32 -max_sequence_len = 1024 -num_heads = 32 -embed_dimension = 32 - -dtype = torch.float16 - -query = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype) -key = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype) -value = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, device=device, dtype=dtype) - -print(f"The default implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") - -# Lets explore the speed of each of the 3 implementations -from torch.backends.cuda import sdp_kernel, SDPBackend - -# Helpful arguments mapper -backend_map = { - SDPBackend.MATH: {"enable_math": True, "enable_flash": False, "enable_mem_efficient": False}, - SDPBackend.FLASH_ATTENTION: {"enable_math": False, "enable_flash": True, "enable_mem_efficient": False}, - SDPBackend.EFFICIENT_ATTENTION: { - "enable_math": False, "enable_flash": False, "enable_mem_efficient": True} -} - -with sdp_kernel(**backend_map[SDPBackend.MATH]): - print(f"The math implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") - - -with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]): - try: - print(f"The flash attention implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") - except RuntimeError: - print("FlashAttention is not supported. See warnings for reasons.") - -with sdp_kernel(**backend_map[SDPBackend.EFFICIENT_ATTENTION]): - try: - print(f"The memory efficient implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") - except RuntimeError: - print("EfficientAttention is not supported. See warnings for reasons.") - -``` - - - - - - -``` -The default implementation runs in 4741.745 microseconds -The math implementation runs in 19249.446 microseconds -The flash attention implementation runs in 4741.583 microseconds -The memory efficient implementation runs in 4193.383 microseconds - -``` - - - - - -## 硬件依赖性 [¶](#hardware-dependence "永久链接到此标题") - - - - - 根据您运行上述单元的机器以及可用的硬件,您的结果可能会有所不同。 -- 如果您没有’ 没有 GPU 并且在 CPU 上运行,则上下文管理器\ n 将没有任何效果,并且所有三个运行都应返回相似的计时。 -- 取决于您的显卡支持的计算能力 -闪存关注或内存效率可能会失败。 - - - - - -## 因果自注意力 [¶](#causal-self-attention "永久链接到此标题") - - - - - 下面是一个多头因果自我注意力块的示例实现,灵感来自于 - [Andrej Karpathy NanoGPT](https://github.com/karpathy/nanoGPT) - 存储库。 - - - - - - -``` -class CausalSelfAttention(nn.Module): - - def __init__(self, num_heads: int, embed_dimension: int, bias: bool=False, is_causal: bool=False, dropout:float=0.0): - super().__init__() - assert embed_dimension % num_heads == 0 - # key, query, value projections for all heads, but in a batch - self.c_attn = nn.Linear(embed_dimension, 3 * embed_dimension, bias=bias) - # output projection - self.c_proj = nn.Linear(embed_dimension, embed_dimension, bias=bias) - # regularization - self.dropout = dropout - self.resid_dropout = nn.Dropout(dropout) - self.num_heads = num_heads - self.embed_dimension = embed_dimension - # Perform causal masking - self.is_causal = is_causal - - def forward(self, x): - # calculate query, key, values for all heads in batch and move head forward to be the batch dim - query_projected = self.c_attn(x) - - batch_size = query_projected.size(0) - embed_dim = query_projected.size(2) - head_dim = embed_dim // (self.num_heads * 3) - - query, key, value = query_projected.chunk(3, -1) - query = query.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2) - key = key.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2) - value = value.view(batch_size, -1, self.num_heads, head_dim).transpose(1, 2) - - if self.training: - dropout = self.dropout - is_causal = self.is_causal - else: - dropout = 0.0 - is_causal = False - - y = F.scaled_dot_product_attention(query, key, value, attn_mask=None, dropout_p=dropout, is_causal=is_causal) - y = y.transpose(1, 2).view(batch_size, -1, self.num_heads * head_dim) - - y = self.resid_dropout(self.c_proj(y)) - return y - - -num_heads = 8 -heads_per_dim = 64 -embed_dimension = num_heads * heads_per_dim -dtype = torch.float16 -model = CausalSelfAttention(num_heads=num_heads, embed_dimension=embed_dimension, bias=False, is_causal=True, dropout=0.1).to("cuda").to(dtype).eval() -print(model) - -``` - - - - - - -``` -CausalSelfAttention( - (c_attn): Linear(in_features=512, out_features=1536, bias=False) - (c_proj): Linear(in_features=512, out_features=512, bias=False) - (resid_dropout): Dropout(p=0.1, inplace=False) -) - -``` - - - - -### `NestedTensor` - 和密集tensor支持 [¶](#nestedtensor-and-dense-tensor-support "永久链接到此标题") - - - - SDPA 支持 - `NestedTensor` - 和密集tensor输入。 - `NestedTensor` - 处理输入是一批可变长度序列的情况 -无需将每个序列填充到最大长度批。有关 - `NestedTensors` 的更多信息,请参阅 - [torch.nested](https://pytorch.org/docs/stable/nested.html) - 和 - [NestedTensors 教程](https://pytorch.org/tutorials/prototype/nestedtensor.html) -. - - - - - - -``` -import random -def generate_rand_batch( - batch_size, - max_sequence_len, - embed_dimension, - pad_percentage=None, - dtype=torch.float16, - device="cuda", -): - if not pad_percentage: - return ( - torch.randn( - batch_size, - max_sequence_len, - embed_dimension, - dtype=dtype, - device=device, - ), - None, - ) - # Random sequence lengths - seq_len_list = [ - int(max_sequence_len * (1 - random.gauss(pad_percentage, 0.01))) - for _ in range(batch_size) - ] - # Make random entry in the batch have max sequence length - seq_len_list[random.randint(0, batch_size - 1)] = max_sequence_len - return ( - torch.nested.nested_tensor( - - [torch.randn(seq_len, embed_dimension, - dtype=dtype, device=device) - for seq_len in seq_len_list - ] - ), - seq_len_list, - ) - -random_nt, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=0.5, dtype=dtype, device=device) -random_dense, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=None, dtype=dtype, device=device) - -# Currently the fused implementations don't support ``NestedTensor`` for training -model.eval() - -with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]): - try: - print(f"Random NT runs in {benchmark_torch_function_in_microseconds(model, random_nt):.3f} microseconds") - print(f"Random Dense runs in {benchmark_torch_function_in_microseconds(model, random_dense):.3f} microseconds") - except RuntimeError: - print("FlashAttention is not supported. See warnings for reasons.") - -``` - - - - - - -``` -/var/lib/jenkins/workspace/intermediate_source/scaled_dot_product_attention_tutorial.py:226: UserWarning: - -The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.) - -Random NT runs in 679.281 microseconds -Random Dense runs in 1183.933 microseconds - -``` - - - - - - - - -# 使用 SDPA 与 - `torch.compile` [¶](#using-sdpa-with-torch-compile "永久链接到此标题") - - - - 随着 PyTorch 2.0 的发布,引入了一项名为 - `torch.compile()` - 的新功能,与 eager 模式相比 -它可以提供 -显着的性能改进。 -缩放点积注意力完全可以与 -组合`torch.compile()` - 。 -为了演示这一点,让’s 使用 - `CausalSelfAttention` - 模块编译 - `torch.compile()` - 并观察由此产生的性能改进. - - - - - - -``` -batch_size = 32 -max_sequence_len = 256 -x = torch.rand(batch_size, max_sequence_len, - embed_dimension, device=device, dtype=dtype) -print( - f"The non compiled module runs in {benchmark_torch_function_in_microseconds(model, x):.3f} microseconds") - - -compiled_model = torch.compile(model) -# Let's compile it -compiled_model(x) -print( - f"The compiled module runs in {benchmark_torch_function_in_microseconds(compiled_model, x):.3f} microseconds") - -``` - - - - - - -``` -The non compiled module runs in 416.696 microseconds -The compiled module runs in 453.513 microseconds - -``` - - - - - 确切的执行时间取决于机器,但是我的结果: -未编译的模块在 166.616 微秒内运行 -编译的模块在 166.726 微秒内运行 -这不是我们所期望的。让’s 更深入地挖掘一下。 -PyTorch 附带了一个令人惊叹的内置分析器,您可以使用它 -检查代码的性能特征。 - - - - - - -``` -from torch.profiler import profile, record_function, ProfilerActivity -activities = [ProfilerActivity.CPU] -if device == 'cuda': - activities.append(ProfilerActivity.CUDA) - -with profile(activities=activities, record_shapes=False) as prof: - with record_function(" Non-Compilied Causal Attention"): - for _ in range(25): - model(x) -print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) - - -with profile(activities=activities, record_shapes=False) as prof: - with record_function("Compiled Causal Attention"): - for _ in range(25): - compiled_model(x) -print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) - -# For even more insights, you can export the trace and use ``chrome://tracing`` to view the results -# :: -# -# prof.export_chrome_trace("compiled_causal_attention_trace.json"). - -``` - - - - - - -``` -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Non-Compilied Causal Attention 16.91% 1.981ms 70.42% 8.250ms 8.250ms 0.000us 0.00% 11.013ms 11.013ms 1 - aten::matmul 2.48% 291.000us 26.92% 3.154ms 63.080us 0.000us 0.00% 8.378ms 167.560us 50 - aten::mm 18.89% 2.213ms 22.68% 2.657ms 53.140us 7.743ms 74.61% 8.378ms 167.560us 50 - aten::linear 2.50% 293.000us 30.21% 3.539ms 70.780us 0.000us 0.00% 7.893ms 157.860us 50 - ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn 0.00% 0.000us 0.00% 0.000us 0.000us 5.550ms 53.48% 5.550ms 222.000us 25 - aten::scaled_dot_product_attention 1.85% 217.000us 14.66% 1.718ms 68.720us 0.000us 0.00% 2.635ms 105.400us 25 - aten::_scaled_dot_product_efficient_attention 3.61% 423.000us 12.81% 1.501ms 60.040us 0.000us 0.00% 2.635ms 105.400us 25 - aten::_efficient_attention_forward 3.36% 394.000us 8.33% 976.000us 39.040us 2.635ms 25.39% 2.635ms 105.400us 25 -fmha_cutlassF_f16_aligned_64x64_rf_sm80(PyTorchMemEf... 0.00% 0.000us 0.00% 0.000us 0.000us 2.635ms 25.39% 2.635ms 105.400us 25 -ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_3... 0.00% 0.000us 0.00% 0.000us 0.000us 2.193ms 21.13% 2.193ms 87.720us 25 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 11.715ms -Self CUDA time total: 10.378ms - -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Compiled Causal Attention 14.58% 1.889ms 90.02% 11.660ms 11.660ms 0.000us 0.00% 12.187ms 12.187ms 1 - CompiledFunction 37.96% 4.916ms 66.21% 8.575ms 343.000us 0.000us 0.00% 12.187ms 487.480us 25 - aten::mm 6.82% 883.000us 10.76% 1.393ms 27.860us 7.767ms 68.85% 8.306ms 166.120us 50 - ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn 0.00% 0.000us 0.00% 0.000us 0.000us 5.572ms 49.39% 5.572ms 222.880us 25 - aten::_scaled_dot_product_efficient_attention 2.01% 260.000us 10.57% 1.369ms 54.760us 0.000us 0.00% 2.867ms 114.680us 25 - aten::_efficient_attention_forward 3.08% 399.000us 7.42% 961.000us 38.440us 2.639ms 23.39% 2.867ms 114.680us 25 -fmha_cutlassF_f16_aligned_64x64_rf_sm80(PyTorchMemEf... 0.00% 0.000us 0.00% 0.000us 0.000us 2.639ms 23.39% 2.639ms 105.560us 25 -ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_3... 0.00% 0.000us 0.00% 0.000us 0.000us 2.195ms 19.46% 2.195ms 87.800us 25 - triton_poi_fused_clone_0 2.84% 368.000us 3.92% 508.000us 20.320us 875.000us 7.76% 1.014ms 40.560us 25 - triton__0d1de 0.00% 0.000us 0.00% 0.000us 0.000us 875.000us 7.76% 875.000us 35.000us 25 -------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ -Self CPU time total: 12.952ms -Self CUDA time total: 11.281ms - -``` - - - - - 前面的代码片段生成了编译模块和非编译模块中消耗最多 GPU 执行时间的前 10 个 PyTorch 函数的报告。 -分析表明,花费在 GPU 上的大部分时间是两个模块集中 -相同的函数集。 -原因是 - `torch.compile` -非常擅长消除 -与 PyTorch 相关的框架开销。如果您的模型正在启动大型、高效的 CUDA 内核(在本例中就是“CausalSelfAttention”),则可以隐藏 PyTorch 的开销。 - - - - - 实际上,您的模块通常不包含单个 - `CausalSelfAttention` - 块。在使用 [Andrej Karpathy NanoGPT](https://github.com/karpathy/nanoGPT) 存储库进行实验时,编译 -模块每个训练步骤的时间从: - `6090.49ms` - 到 - `3273.17ms` -!这是在 Shakespeare 数据集上的 NanoGPT 训练提交时完成的: - `ae3a8d5` -。 - - - - - -# 结论 [¶](#conclusion "永久链接到此标题") - - - - 在本教程中,我们演示了 - `torch.nn.function.scaled_dot_product_attention` - 的基本用法。我们已经展示了如何使用 -`sdp_kernel` - 上下文管理器来断言在 GPU 上使用了某个 -实现。此外,我们还构建了一个简单的“CausalSelfAttention”模块,该模块可与“NestedTensor”配合使用,并且可进行 torch 编译。在此过程中,我们展示了如何使用分析工具 -来探索用户定义 -模块的性能特征。 - - - - -**脚本的总运行时间:** - ( 0 分 8.239 秒) diff --git a/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial.md b/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial.md index 1c9814ff9..6c4504953 100755 --- a/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial.md +++ b/docs/2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial.md @@ -1,91 +1,30 @@ -# (测试版)通过缩放点积注意力 (SDPA) 实现高性能 Transformer [¶](#beta-implementing-high-performance-transformers-with-scaled-dot-product-attention- sdpa"此标题的永久链接") +# (测试版)使用缩放点积注意力(SDPA)实现高性能Transformers [¶](#beta-implementing-high-performance-transformers-with-scaled-dot-product-attention- sdpa"此标题的永久链接") - -> 译者:[片刻小哥哥](https://github.com/jiangzhonglian) +> 译者:[liuenci](https://github.com/liuenci) > > 项目地址: > > 原始地址: - - - -**作者:** -[Driss Guessous](https://github.com/drisspg) - - - - +**作者**: [Driss Guessous](https://github.com/drisspg) ## 摘要 [¶](#summary "此标题的永久链接") - - - - - 在本教程中,我们想要重点介绍一个新的 - `torch.nn.function` - 函数,它有助于实现 Transformer 架构。该函数名为 - `torch.nn.function.scaled_dot_product_attention` - 。 -有关该函数的详细说明,请参阅 - [PyTorch 文档](https://pytorch.org/docs/master/generated/torch.nn.function.scaled_dot_product_attention.html#torch.nn.function.scaled_dot_product_attention) -. -此函数已合并到 - `torch.nn.MultiheadAttention` - 和 - `torch.nn.TransformerEncoderLayer` -. - - - - +在本教程中,我们将介绍一个新的torch.nn.functional函数,它对于实现 Transformers 架构非常有帮助。这个函数名为torch.nn.functional.scaled_dot_product_attention。有关该函数的详细描述,请参阅[PyTorch 文档](https://pytorch.org/docs/master/generated/torch.nn.function.scaled_dot_product_attention.html#torch.nn.function.scaled_dot_product_attention) 。此函数已经被整合到torch.nn.MultiheadAttention和torch.nn.TransformerEncoderLayer中。 ## 概述 [¶](#overview "此标题的永久链接") - - - - - 在较高层面上,此 PyTorch 函数根据 -论文中的定义计算查询、键和值之间的 -缩放点积注意力 (SDPA) - [注意力就是您所需要的](https://arxiv.org/abs/1706.03762) - 。虽然可以使用现有函数在 PyTorch 中编写此函数,但融合实现可以比原始实现提供更大的性能优势。 - - - - +从深层次来看,这个PyTorch函数根据论文《Attention is all you need》中的定义,计算查询(query)、键(key)和值(value)之间的缩放点积注意力(SDPA)。虽然这个函数可以使用现有的PyTorch函数编写,但一个融合实现(fused implementation)可以比朴素实现提供更大的性能优势。 ## 融合实现 [¶](#fused-implementations "永久链接到此标题") +对于CUDA张量输入,该函数将分派到以下实现之一: +1. **FlashAttention**:这是一种快速且内存高效的精确注意力机制,具有IO感知能力。这种实现优化了计算速度,并考虑到输入/输出操作对性能的影响。 +2. **内存高效注意力**:这种实现旨在减少在执行缩放点积注意力时所需的内存占用,这对于处理大型模型或长序列尤为重要。 +3. **C++中定义的PyTorch实现**:这指的是在C++中编写的PyTorch函数实现,通常用于提高性能,因为C++编写的代码可以直接与底层硬件进行交互,从而优化计算效率。 +本教程需要PyTorch 2.0.0或更高版本。 - 对于 CUDA tensor输入,该函数将分派到以下实现之一 -: - - - -* [FlashAttention:具有 IO 感知的快速、内存高效的精确注意力](https://arxiv.org/abs/2205.14135) -* [内存高效的注意力](https://github.com/facebookresearch/xformers ) -* 用 C++ 定义的 PyTorch 实现 - - - - - 注意 - - - - - 本教程需要 PyTorch 2.0.0 或更高版本。 - - - - - - - -``` +```py import torch import torch.nn as nn import torch.nn.functional as F @@ -94,15 +33,9 @@ device = "cuda" if torch.cuda.is_available() else "cpu" # Example Usage: query, key, value = torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device), torch.randn(2, 3, 8, device=device) F.scaled_dot_product_attention(query, key, value) - ``` - - - - - -``` +```py tensor([[[-1.3321, -0.3489, 0.3015, -0.3912, 0.9867, 0.3137, -0.0691, -1.2593], [-1.0882, 0.2506, 0.6491, 0.1360, 0.5238, -0.2448, -0.0820, @@ -116,33 +49,13 @@ tensor([[[-1.3321, -0.3489, 0.3015, -0.3912, 0.9867, 0.3137, -0.0691, 0.2312], [ 0.4048, 0.0762, 0.3777, 0.4689, -0.2978, 0.2754, -0.6429, 0.1037]]], device='cuda:0') - ``` +## 显式调度器控制 [¶](#explicit-dispatcher-control "永久链接到此标题") +虽然该函数会隐式地分派到三种实现之一,但用户也可以通过使用上下文管理器(context manager)来显式控制分派。这个上下文管理器允许用户显式禁用某些实现。如果用户想确保函数确实针对他们的特定输入使用最快的实现,可以使用上下文管理器来遍历并测量性能。 - - -## 显式调度程序控制 [¶](#explicit-dispatcher-control "永久链接到此标题") - - - - - 虽然该函数将隐式分派到三个 -实现之一,但用户还可以通过使用上下文管理器 -显式控制分派。此上下文管理器允许用户 -显式禁用某些实现。如果用户想要确保 -该函数确实对其特定输入使用 -最快的实现, -可以使用上下文管理器来扫描 -测量性能。 - - - - - - -``` +```py # Lets define a helpful benchmarking function: import torch.utils.benchmark as benchmark def benchmark_torch_function_in_microseconds(f, *args, **kwargs): @@ -166,80 +79,47 @@ value = torch.rand(batch_size, num_heads, max_sequence_len, embed_dimension, dev print(f"The default implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") # Lets explore the speed of each of the 3 implementations -from torch.backends.cuda import sdp_kernel, SDPBackend - -# Helpful arguments mapper -backend_map = { - SDPBackend.MATH: {"enable_math": True, "enable_flash": False, "enable_mem_efficient": False}, - SDPBackend.FLASH_ATTENTION: {"enable_math": False, "enable_flash": True, "enable_mem_efficient": False}, - SDPBackend.EFFICIENT_ATTENTION: { - "enable_math": False, "enable_flash": False, "enable_mem_efficient": True} -} +from torch.nn.attention import SDPBackend, sdpa_kernel -with sdp_kernel(**backend_map[SDPBackend.MATH]): - print(f"The math implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") +with sdpa_kernel(SDPBackend.MATH): + math_time=benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value) + print(f"The math implementation runs in {math_time:.3f} microseconds") -with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]): +with sdpa_kernel(SDPBackend.FLASH_ATTENTION): try: - print(f"The flash attention implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") + flash_time=benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value) + print(f"The flash attention implementation runs in {flash_time:.3f} microseconds") except RuntimeError: print("FlashAttention is not supported. See warnings for reasons.") -with sdp_kernel(**backend_map[SDPBackend.EFFICIENT_ATTENTION]): +with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION): try: - print(f"The memory efficient implementation runs in {benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value):.3f} microseconds") + efficient_time=benchmark_torch_function_in_microseconds(F.scaled_dot_product_attention, query, key, value) + print(f"The memory efficient implementation runs in {efficient_time:.3f} microseconds") except RuntimeError: print("EfficientAttention is not supported. See warnings for reasons.") - ``` - - - - +```py +The default implementation runs in 2304.977 microseconds +The math implementation runs in 19249.369 microseconds +The flash attention implementation runs in 2304.600 microseconds +The memory efficient implementation runs in 4197.082 microseconds ``` -The default implementation runs in 4741.745 microseconds -The math implementation runs in 19249.446 microseconds -The flash attention implementation runs in 4741.583 microseconds -The memory efficient implementation runs in 4193.383 microseconds - -``` - - - - ## 硬件依赖性 [¶](#hardware-dependence "永久链接到此标题") +根据您在上面代码单元运行的机器以及可用的硬件,您得到的结果可能会有所不同: +- 如果您没有GPU并且是在CPU上运行,那么上下文管理器将不起作用,三次运行应该返回相似的时间。 +- 根据您的显卡支持的计算能力,FlashAttention或内存高效注意力可能会失败。 +## 因果自注意力[¶](#causal-self-attention "永久链接到此标题") +下面是一个因果自注意力(multi-headed causal self attention)块的示例实现,灵感来源于Andrej Karpathy的NanoGPT仓库。 - 根据您运行上述单元的机器以及可用的硬件,您的结果可能会有所不同。 -- 如果您没有’ 没有 GPU 并且在 CPU 上运行,则上下文管理器\ n 将没有任何效果,并且所有三个运行都应返回相似的计时。 -- 取决于您的显卡支持的计算能力 -闪存关注或内存效率可能会失败。 - - - - - -## 因果自注意力 [¶](#causal-self-attention "永久链接到此标题") - - - - - 下面是一个多头因果自我注意力块的示例实现,灵感来自于 - [Andrej Karpathy NanoGPT](https://github.com/karpathy/nanoGPT) - 存储库。 - - - - - - -``` +```py class CausalSelfAttention(nn.Module): def __init__(self, num_heads: int, embed_dimension: int, bias: bool=False, is_causal: bool=False, dropout:float=0.0): @@ -290,49 +170,22 @@ embed_dimension = num_heads * heads_per_dim dtype = torch.float16 model = CausalSelfAttention(num_heads=num_heads, embed_dimension=embed_dimension, bias=False, is_causal=True, dropout=0.1).to("cuda").to(dtype).eval() print(model) - ``` - - - - -``` +```py CausalSelfAttention( (c_attn): Linear(in_features=512, out_features=1536, bias=False) (c_proj): Linear(in_features=512, out_features=512, bias=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) - ``` +## NestedTensor 和 Dense 张量支持 +SDPA支持NestedTensor和Dense张量输入。NestedTensors处理的情况是输入是一个不等长序列的批次,而无需将每个序列填充到批次中的最大长度。有关NestedTensors的更多信息,请参阅torch.nested和NestedTensors教程。 - -### `NestedTensor` - 和密集tensor支持 [¶](#nestedtensor-and-dense-tensor-support "永久链接到此标题") - - - - SDPA 支持 - `NestedTensor` - 和密集tensor输入。 - `NestedTensor` - 处理输入是一批可变长度序列的情况 -无需将每个序列填充到最大长度批。有关 - `NestedTensors` 的更多信息,请参阅 - [torch.nested](https://pytorch.org/docs/stable/nested.html) - 和 - [NestedTensors 教程](https://pytorch.org/tutorials/prototype/nestedtensor.html) -. - - - - - - -``` +```py import random def generate_rand_batch( batch_size, @@ -362,8 +215,8 @@ def generate_rand_batch( seq_len_list[random.randint(0, batch_size - 1)] = max_sequence_len return ( torch.nested.nested_tensor( - - [torch.randn(seq_len, embed_dimension, + [ + torch.randn(seq_len, embed_dimension, dtype=dtype, device=device) for seq_len in seq_len_list ] @@ -377,106 +230,53 @@ random_dense, _ = generate_rand_batch(32, 512, embed_dimension, pad_percentage=N # Currently the fused implementations don't support ``NestedTensor`` for training model.eval() -with sdp_kernel(**backend_map[SDPBackend.FLASH_ATTENTION]): +with sdpa_kernel(SDPBackend.FLASH_ATTENTION): try: print(f"Random NT runs in {benchmark_torch_function_in_microseconds(model, random_nt):.3f} microseconds") print(f"Random Dense runs in {benchmark_torch_function_in_microseconds(model, random_dense):.3f} microseconds") except RuntimeError: print("FlashAttention is not supported. See warnings for reasons.") - ``` - - - - -``` -/var/lib/jenkins/workspace/intermediate_source/scaled_dot_product_attention_tutorial.py:226: UserWarning: +```py +/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py:166: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. (Triggered internally at ../aten/src/ATen/NestedTensorImpl.cpp:178.) -Random NT runs in 679.281 microseconds -Random Dense runs in 1183.933 microseconds - +Random NT runs in 558.517 microseconds +Random Dense runs in 936.630 microseconds ``` +## 使用 torch.compile 与 SDPA [¶](#using-sdpa-with-torch-compile "永久链接到此标题") +随着PyTorch 2.0的发布,引入了一个名为torch.compile()的新特性,它可以在急切模式(eager mode)上提供显著性能提升。缩放点积注意力(SDPA)与torch.compile()完全兼容。为了演示这一点,我们将使用torch.compile()编译CausalSelfAttention模块,并观察由此带来的性能提升。 - - - - - -# 使用 SDPA 与 - `torch.compile` [¶](#using-sdpa-with-torch-compile "永久链接到此标题") - - - - - 随着 PyTorch 2.0 的发布,引入了一项名为 - `torch.compile()` - 的新功能,与 eager 模式相比 -它可以提供 -显着的性能改进。 -缩放点积注意力完全可以与 -组合`torch.compile()` - 。 -为了演示这一点,让’s 使用 - `CausalSelfAttention` - 模块编译 - `torch.compile()` - 并观察由此产生的性能改进. - - - - - - -``` +```py batch_size = 32 max_sequence_len = 256 x = torch.rand(batch_size, max_sequence_len, embed_dimension, device=device, dtype=dtype) print( - f"The non compiled module runs in {benchmark_torch_function_in_microseconds(model, x):.3f} microseconds") + f"The non compiled module runs in {benchmark_torch_function_in_microseconds(model, x):.3f} microseconds") compiled_model = torch.compile(model) # Let's compile it compiled_model(x) print( - f"The compiled module runs in {benchmark_torch_function_in_microseconds(compiled_model, x):.3f} microseconds") - + f"The compiled module runs in {benchmark_torch_function_in_microseconds(compiled_model, x):.3f} microseconds") ``` - - - - +```py +The non compiled module runs in 408.207 microseconds +The compiled module runs in 516.612 microseconds ``` -The non compiled module runs in 416.696 microseconds -The compiled module runs in 453.513 microseconds - -``` - - - - - 确切的执行时间取决于机器,但是我的结果: -未编译的模块在 166.616 微秒内运行 -编译的模块在 166.726 微秒内运行 -这不是我们所期望的。让’s 更深入地挖掘一下。 -PyTorch 附带了一个令人惊叹的内置分析器,您可以使用它 -检查代码的性能特征。 - - +具体的执行时间取决于机器,但我的结果是:未编译的模块运行时间为166.616微秒,编译后的模块运行时间为166.726微秒。这并不是我们期望的结果。让我们深入探究一下。PyTorch内置了一个惊人的性能分析器(profiler),您可以使用它来检查代码的性能特征。 - - -``` +```py from torch.profiler import profile, record_function, ProfilerActivity activities = [ProfilerActivity.CPU] if device == 'cuda': @@ -496,98 +296,130 @@ with profile(activities=activities, record_shapes=False) as prof: print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10)) # For even more insights, you can export the trace and use ``chrome://tracing`` to view the results -# :: # -# prof.export_chrome_trace("compiled_causal_attention_trace.json"). - +# .. code-block:: python +# +# prof.export_chrome_trace("compiled_causal_attention_trace.json"). ``` - - - - ``` ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ - Non-Compilied Causal Attention 16.91% 1.981ms 70.42% 8.250ms 8.250ms 0.000us 0.00% 11.013ms 11.013ms 1 - aten::matmul 2.48% 291.000us 26.92% 3.154ms 63.080us 0.000us 0.00% 8.378ms 167.560us 50 - aten::mm 18.89% 2.213ms 22.68% 2.657ms 53.140us 7.743ms 74.61% 8.378ms 167.560us 50 - aten::linear 2.50% 293.000us 30.21% 3.539ms 70.780us 0.000us 0.00% 7.893ms 157.860us 50 - ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn 0.00% 0.000us 0.00% 0.000us 0.000us 5.550ms 53.48% 5.550ms 222.000us 25 - aten::scaled_dot_product_attention 1.85% 217.000us 14.66% 1.718ms 68.720us 0.000us 0.00% 2.635ms 105.400us 25 - aten::_scaled_dot_product_efficient_attention 3.61% 423.000us 12.81% 1.501ms 60.040us 0.000us 0.00% 2.635ms 105.400us 25 - aten::_efficient_attention_forward 3.36% 394.000us 8.33% 976.000us 39.040us 2.635ms 25.39% 2.635ms 105.400us 25 -fmha_cutlassF_f16_aligned_64x64_rf_sm80(PyTorchMemEf... 0.00% 0.000us 0.00% 0.000us 0.000us 2.635ms 25.39% 2.635ms 105.400us 25 -ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_3... 0.00% 0.000us 0.00% 0.000us 0.000us 2.193ms 21.13% 2.193ms 87.720us 25 + Non-Compilied Causal Attention 20.01% 2.285ms 77.24% 8.821ms 8.821ms 0.000us 0.00% 11.098ms 11.098ms 1 + Non-Compilied Causal Attention 0.00% 0.000us 0.00% 0.000us 0.000us 10.328ms 50.41% 10.328ms 10.328ms 1 + aten::matmul 2.36% 269.000us 27.28% 3.115ms 62.300us 0.000us 0.00% 8.156ms 163.120us 50 + aten::mm 18.72% 2.138ms 22.97% 2.623ms 52.460us 7.750ms 37.83% 8.156ms 163.120us 50 + aten::linear 1.62% 185.000us 30.99% 3.539ms 70.780us 0.000us 0.00% 8.068ms 161.360us 50 + ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_tn 0.00% 0.000us 0.00% 0.000us 0.000us 5.552ms 27.10% 5.552ms 222.080us 25 + aten::scaled_dot_product_attention 1.97% 225.000us 17.75% 2.027ms 81.080us 0.000us 0.00% 2.942ms 117.680us 25 + aten::_scaled_dot_product_flash_attention 3.38% 386.000us 15.78% 1.802ms 72.080us 0.000us 0.00% 2.942ms 117.680us 25 + aten::_flash_attention_forward 4.45% 508.000us 11.48% 1.311ms 52.440us 2.411ms 11.77% 2.942ms 117.680us 25 +void pytorch_flash::flash_fwd_kernel + +tensor([[ True, False, False, False, False, False, False, False, False, False], + [ True, True, False, False, False, False, False, False, False, False]]) +tensor([[ True, True, True, True, True, True, True, True, True, False], + [ True, True, True, True, True, True, True, True, True, True]]) +``` + +## 结论 +在本教程中,我们演示了torch.nn.functional.scaled_dot_product_attention的基本用法。我们展示了如何使用sdpa_kernel上下文管理器来确保在GPU上使用特定的实现。此外,我们还构建了一个简单的CausalSelfAttention模块,该模块与NestedTensor兼容,并且可以被torch编译。在这个过程中,我们还展示了如何使用性能分析工具来探索用户定义模块的性能特征。 -**脚本的总运行时间:** - ( 0 分 8.239 秒) +脚本总运行时间:(0分钟7.894秒) \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index a6d69471e..ee3a07e89 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -182,9 +182,9 @@ nav: - "Introduction to torch.compile": "2.0/tutorials/intermediate/torch_compile_tutorial.md" - "Inductor CPU backend debugging and profiling": "2.0/tutorials/intermediate/inductor_debug_cpu.md" - "(Beta) Implementing High-Performance Transformers with Scaled Dot Product Attention (SDPA)": "2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial.md" - - "Using SDPA with torch.compile": "2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#using-sdpa-with-torch-compile.md" - - "Using SDPA with attn_bias subclasses`": "2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#using-sdpa-with-attn-bias-subclasses.md" - - "Conclusion": "2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial#conclusion.md" + - "Using SDPA with torch.compile": "2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial.md" + - "Using SDPA with attn_bias subclasses`": "2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial.md" + - "Conclusion": "2.0/tutorials/intermediate/scaled_dot_product_attention_tutorial.md" - "Knowledge Distillation Tutorial": "2.0/tutorials/beginner/knowledge_distillation_tutorial.md" - "Parallel and Distributed Training": - "Distributed and Parallel Training Tutorials": "2.0/tutorials/distributed/home.md" diff --git a/themes_material b/themes_material index c75743a3b..cc93b8903 160000 --- a/themes_material +++ b/themes_material @@ -1 +1 @@ -Subproject commit c75743a3b7a0613d5e3ba3419668c1d6b3007dd3 +Subproject commit cc93b89037fc25e277379fd0bed7e59d367c150e