Skip to content

Conversation

@titaiwangms
Copy link
Contributor

This pull request updates the attention kernel selection logic and clarifies support for unidirectional (causal) attention in the CUDA attention implementation. The main changes focus on improving documentation, removing outdated comments, and explicitly setting the kernel type for better maintainability and clarity.

Kernel selection and configuration improvements:

  • Explicitly set the kernel_type field to AttentionKernel_Unfused in the AttentionData structure to clarify which kernel is being used and improve future extensibility.

Documentation and code clarity:

  • Added comments to clarify that unidirectional (causal) attention is supported by several attention kernel implementations, and that the TRT fused runner is only used for non-unidirectional cases, as enforced elsewhere.
  • Removed outdated TODO comments regarding parameter continuation and kernel selection, as these are now handled more explicitly in the code. [1] [2]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the attention kernel selection logic for the CUDA Attention operator by explicitly setting kernel types and removing outdated code. The changes enhance code clarity and maintainability while enabling proper support for causal (unidirectional) attention with the unfused kernel path.

Changes:

  • Explicitly set kernel_type to AttentionKernel_Unfused in the Attention operator to clarify kernel selection
  • Remove outdated TODO comments about parameter handling and kernel selection that are now properly implemented
  • Relax the assertion in PrepareQkv_MHA_NoPast that incorrectly prevented causal attention, replacing it with a clarifying comment about which kernels support unidirectional attention

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
onnxruntime/core/providers/cuda/llm/attention.cc Removes obsolete TODOs and explicitly sets kernel_type to AttentionKernel_Unfused, making kernel selection more explicit and maintainable
onnxruntime/contrib_ops/cuda/bert/attention_prepare_qkv.cu Removes overly restrictive assertion blocking causal attention and adds explanatory comment about kernel support for is_unidirectional flag

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants