You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This RFC proposes using Triton language to implement Intel GPU (torch.xpu) kernels.
1.1. Motivation
This RFC aims to discuss the XPU kernel implementation with the Triton language. The Triton kernel will contain most of the device-agnostic logic, and the device-specific logic will be implemented separately.
The only package requirement for this is triton package. However, as triton is by default the requirements of torch, we could ignore this package requirements in torchvision.
2. Op List
nms
roi pool
roi align
ps roi pool
ps roi align
deform_conv2d
3. File Structure
torchvision
├── csrc
│ └── ops# This will contain the current op registration in C++, will be no change
│ └── nms.cpp
└── ops# This folder is pure Python
├── triton# The shared Triton kernel implementation. It won't register the kernel.
│ └── nms_kernel.py
└── xpu# The device-specific kernel implementation and registration
└── nms_kernel.py
4. Details
4.1. Register Custom Kernel Implementation
The kernel implementation consists of two parts. Firstly, we aim to use Triton to implement device-agnostic logic to cover as much of the logic as possible. Secondly, we will use ATen operations for the logic that Triton does not support and device-specific logics.
4.1.1. Triton Kernel Implementation
To extend the generality, we propose to implement the main logic using the Triton language. This is device-agnostic. As long as the device/platform support Triton, the Triton kernel should work.
This will implement device specific logic. There are two aspects:
Triton unsupported ops. For example, tensor initialization, argsort, etc.
Device-specific Triton configuration. This would affect the performance of the kernel (e.g., pass Triton config to the kernel).
The developer will have to wrap the above logic into a custom kernel and register the kernel to PyTorch using torch.library.register_kernel(kernel_name, device).
# torchvision/ops/xpu/nms_kernel.py@torch.library.register_kernel("torchvision::nms", "xpu")defxpu_custom_nms_triton_kernel(boxes: Tensor, scores: Tensor, threshold: float) ->Tensor:
# 1 : Perform operations that Triton doesn't support:# - Tensor initialization, which may be device-specificoutput=torch.ones(num_boxes, dtype=torch.int32, device=boxes.device)
# - Use ATen calls for Triton unsupported ops (argsort in this case)order=torch.argsort(scores, descending=True)
boxes=boxes[order]
...
# 2 : Call Triton Kernel# Hardware vendors can decide which configuration is optimal for their device. # For example, BLOCK_SIZE in this case.grid=lambdameta: (triton.cdiv(num_boxes, meta['BLOCK_SIZE']),)
triton_nms_kernel[grid](boxes, scores, mask, threshold, num_boxes, BLOCK_SIZE=meta['BLOCK_SIZE'])
# 3 : Post Processing Logicoutput=torch.narrow(...)
returnoutput
In this way, the user could call the torchvision.ops.nms with both eager mode and torch.compile.
5. Run The Kernel
5.1. Eager Mode
The user can directly call the kernel in eager mode. The nms call will fall into the custom implementation in this scenario.
The torch.compile will use inductor compiler to decompose ops into primitive ops. When the op does not have registered the decomposition, it will fallback to ATen call. i.e., In this scenario, the inductor won't decompose our custom kernel, and it will behave the same as eager mode.
6. Alternatives
Native C++ Implementation:
Currently, all the kernel backends are in torchvision/csrc/ops and are written in C++. This way requires every device backend developer to implement using their own compiler/language (for example, CUDA with nvcc). The pros of native C++ are that it could reach peak performance, and it will have Windows support.
By comparison, Triton has some limitations. For example, Triton does not support manually putting the tensor into the shared memory, thus the logic in CUDA nms kernel could not be implemented in Triton. However, the pros of Triton are that Triton is more general with reasonable performance. The optimization logic in Triton can be shared by different devices.
7. Limitations
Windows Support: Currently, Triton does not support Windows. Thus, kernels implemented with Triton will fail on Windows. However, when the Triton supports the Windows platform, we don't need any additional efforts.
Performance Alignment: Since the Triton has some limitations, the performance may not fully align with native C++ implementation.
8. Additional context
Is it possible that both Triton kernel and C++ implementation exist for torch.compile mode?
In this case, one may register as a custom lowering. Like the following code:
🚀 1. The feature
This RFC proposes using Triton language to implement Intel GPU (
torch.xpu
) kernels.1.1. Motivation
This RFC aims to discuss the XPU kernel implementation with the Triton language. The Triton kernel will contain most of the device-agnostic logic, and the device-specific logic will be implemented separately.
The only package requirement for this is
triton
package. However, as triton is by default the requirements oftorch
, we could ignore this package requirements intorchvision
.2. Op List
3. File Structure
4. Details
4.1. Register Custom Kernel Implementation
The kernel implementation consists of two parts. Firstly, we aim to use Triton to implement device-agnostic logic to cover as much of the logic as possible. Secondly, we will use ATen operations for the logic that Triton does not support and device-specific logics.
4.1.1. Triton Kernel Implementation
To extend the generality, we propose to implement the main logic using the Triton language. This is device-agnostic. As long as the device/platform support Triton, the Triton kernel should work.
4.1.2. XPU Kernel Implementation and Registration
This will implement device specific logic. There are two aspects:
The developer will have to wrap the above logic into a custom kernel and register the kernel to PyTorch using
torch.library.register_kernel(kernel_name, device)
.In this way, the user could call the
torchvision.ops.nms
with both eager mode andtorch.compile
.5. Run The Kernel
5.1. Eager Mode
The user can directly call the kernel in eager mode. The
nms
call will fall into the custom implementation in this scenario.5.2. torch.compile Mode
The user can call
torch.compile
with the following:The
torch.compile
will use inductor compiler to decompose ops into primitive ops. When the op does not have registered the decomposition, it will fallback to ATen call. i.e., In this scenario, the inductor won't decompose our custom kernel, and it will behave the same as eager mode.6. Alternatives
Currently, all the kernel backends are in
torchvision/csrc/ops
and are written in C++. This way requires every device backend developer to implement using their own compiler/language (for example, CUDA with nvcc). The pros of native C++ are that it could reach peak performance, and it will have Windows support.By comparison, Triton has some limitations. For example, Triton does not support manually putting the tensor into the shared memory, thus the logic in CUDA nms kernel could not be implemented in Triton. However, the pros of Triton are that Triton is more general with reasonable performance. The optimization logic in Triton can be shared by different devices.
7. Limitations
8. Additional context
In this case, one may register as a custom lowering. Like the following code:
cc: @EikanWang @fengyuan14 @riverliuintel
The text was updated successfully, but these errors were encountered: