[RFC] Intel GPU (XPU) Support with Triton Implementation #8679

Stonepia · 2024-10-11T01:22:45Z

🚀 1. The feature

This RFC proposes using Triton language to implement Intel GPU (torch.xpu) kernels.

1.1. Motivation

This RFC aims to discuss the XPU kernel implementation with the Triton language. The Triton kernel will contain most of the device-agnostic logic, and the device-specific logic will be implemented separately.

The only package requirement for this is triton package. However, as triton is by default the requirements of torch, we could ignore this package requirements in torchvision.

2. Op List

3. File Structure

torchvision
├── csrc
│   └── ops   # This will contain the current op registration in C++, will be no change
│       └── nms.cpp
└── ops        # This folder is pure Python
  ├── triton # The shared Triton kernel implementation. It won't register the kernel.
  │   └── nms_kernel.py
  └── xpu    # The device-specific kernel implementation and registration
    └── nms_kernel.py

4. Details

4.1. Register Custom Kernel Implementation

The kernel implementation consists of two parts. Firstly, we aim to use Triton to implement device-agnostic logic to cover as much of the logic as possible. Secondly, we will use ATen operations for the logic that Triton does not support and device-specific logics.

4.1.1. Triton Kernel Implementation

To extend the generality, we propose to implement the main logic using the Triton language. This is device-agnostic. As long as the device/platform support Triton, the Triton kernel should work.

# torchvision/ops/triton/nms_kernel.py
@triton.jit
def triton_nms_kernel(boxes, scores, output_ptr, threshold, num_boxes, BLOCK_SIZE: tl.constexpr):
  pid = tl.program_id(0)
  block_start = pid * BLOCK_SIZE
  offsets = block_start + tl.arange(0, BLOCK_SIZE)
  ...

4.1.2. XPU Kernel Implementation and Registration

This will implement device specific logic. There are two aspects:

Triton unsupported ops. For example, tensor initialization, argsort, etc.
Device-specific Triton configuration. This would affect the performance of the kernel (e.g., pass Triton config to the kernel).

The developer will have to wrap the above logic into a custom kernel and register the kernel to PyTorch using torch.library.register_kernel(kernel_name, device).

# torchvision/ops/xpu/nms_kernel.py

@torch.library.register_kernel("torchvision::nms", "xpu")
def xpu_custom_nms_triton_kernel(boxes: Tensor, scores: Tensor, threshold: float) -> Tensor:
  # 1 : Perform operations that Triton doesn't support:
  # - Tensor initialization, which may be device-specific
  output = torch.ones(num_boxes, dtype=torch.int32, device=boxes.device)
  # - Use ATen calls for Triton unsupported ops (argsort in this case)
  order = torch.argsort(scores, descending=True)
  boxes = boxes[order]
  ...

  # 2 : Call Triton Kernel
  # Hardware vendors can decide which configuration is optimal for their device. 
  # For example, BLOCK_SIZE in this case.
  grid = lambda meta: (triton.cdiv(num_boxes, meta['BLOCK_SIZE']),)
  triton_nms_kernel[grid](boxes, scores, mask, threshold, num_boxes, BLOCK_SIZE=meta['BLOCK_SIZE'])

  # 3 : Post Processing Logic
  output = torch.narrow(...)

  return output

In this way, the user could call the torchvision.ops.nms with both eager mode and torch.compile.

5. Run The Kernel

5.1. Eager Mode

The user can directly call the kernel in eager mode. The nms call will fall into the custom implementation in this scenario.

output = torchvision.ops.nms(boxes, scores, threshold)
print(output)

5.2. torch.compile Mode

The user can call torch.compile with the following:

nms_compile = torch.compile(torchvision.ops.nms)
compiled_output = nms_compile(boxes, scores, threshold)

The torch.compile will use inductor compiler to decompose ops into primitive ops. When the op does not have registered the decomposition, it will fallback to ATen call. i.e., In this scenario, the inductor won't decompose our custom kernel, and it will behave the same as eager mode.

6. Alternatives

Native C++ Implementation:

Currently, all the kernel backends are in torchvision/csrc/ops and are written in C++. This way requires every device backend developer to implement using their own compiler/language (for example, CUDA with nvcc). The pros of native C++ are that it could reach peak performance, and it will have Windows support.

By comparison, Triton has some limitations. For example, Triton does not support manually putting the tensor into the shared memory, thus the logic in CUDA nms kernel could not be implemented in Triton. However, the pros of Triton are that Triton is more general with reasonable performance. The optimization logic in Triton can be shared by different devices.

7. Limitations

Windows Support: Currently, Triton does not support Windows. Thus, kernels implemented with Triton will fail on Windows. However, when the Triton supports the Windows platform, we don't need any additional efforts.
Performance Alignment: Since the Triton has some limitations, the performance may not fully align with native C++ implementation.

8. Additional context

Is it possible that both Triton kernel and C++ implementation exist for torch.compile mode?

In this case, one may register as a custom lowering. Like the following code:

@register_lowering(torchvision.ops.nms)
def nms(...):
  if device == 'xpu' and USE_ATEN_BACKEND:
     aten_cpp_nms(...)
  elif device == 'xpu' and USE_TRITON_BACKEND:
     triton_nms(...)
  else:
     ...

cc: @EikanWang @fengyuan14 @riverliuintel

The text was updated successfully, but these errors were encountered:

EikanWang · 2024-10-11T01:44:21Z

@NicolasHug , may I know your comments?

abhi-glitchhg · 2024-10-21T20:11:59Z

This is interesting! ❤️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Intel GPU (XPU) Support with Triton Implementation #8679

[RFC] Intel GPU (XPU) Support with Triton Implementation #8679

Stonepia commented Oct 11, 2024 •

edited

Loading

EikanWang commented Oct 11, 2024

abhi-glitchhg commented Oct 21, 2024

[RFC] Intel GPU (XPU) Support with Triton Implementation #8679

[RFC] Intel GPU (XPU) Support with Triton Implementation #8679

Comments

Stonepia commented Oct 11, 2024 • edited Loading

🚀 1. The feature

1.1. Motivation

2. Op List

3. File Structure

4. Details

4.1. Register Custom Kernel Implementation

4.1.1. Triton Kernel Implementation

4.1.2. XPU Kernel Implementation and Registration

5. Run The Kernel

5.1. Eager Mode

5.2. torch.compile Mode

6. Alternatives

7. Limitations

8. Additional context

EikanWang commented Oct 11, 2024

abhi-glitchhg commented Oct 21, 2024

Stonepia commented Oct 11, 2024 •

edited

Loading