Attenion(23) CUDA #26466

titaiwangms · 2025-10-31T22:54:14Z

Fixes #24554
This pull request introduces support for 4D QKV input tensors in the Attention operator, specifically in the unfused CUDA kernel for Attention-23, and refines the handling of sequence lengths and causal masking for improved correctness and flexibility. The changes touch on both the operator documentation and several CUDA implementation files, enabling new input formats and ensuring correct output shapes and masking in various scenarios.

Support for 4D QKV input and improved QKV format handling:

Added support for the Q_K_V_BNSH (4D) QKV input format in the CUDA Attention kernel, including updates to AttentionParameters and AttentionData structs to track whether the input is 4D. The kernel now writes the output directly when the input is 4D, avoiding unnecessary transposes. [1] [2] [3] [4] [5] [6] [7] [8]
The QKV preparation logic (PrepareQkv_MHA_NoPast, PrepareQkv_MHA_WithPast_NoBias, and PrepareQkv_MultiHeadAttention) now supports both 3D (BSNH) and 4D (BNSH) input formats, with appropriate assertions and error handling for unsupported scenarios (e.g., bias with 4D input). [1] [2] [3] [4] [5] [6] [7]

Causal masking and sequence length correctness:

Refactored the causal masking logic in softmax CUDA kernels to use the explicit past_sequence_length parameter, ensuring correct attention windowing for both incremental and non-incremental decoding. This affects both small and large softmax kernels and their raw-mask variants. [1] [2] [3] [4] [5] [6] [7] [8]
Updated kernel launch sites to propagate the new past_sequence_length argument, ensuring all relevant CUDA calls use the correct sequence length for masking. [1] [2]

Sequence length variable naming and usage:

Renamed and corrected the use of sequence_length and kv_sequence_length in the context of QKV-to-context computations, ensuring the correct sequence length is used for key/value tensors. [1] [2]

Documentation update:

Updated the operator documentation (OperatorKernels.md) to describe the new input/output signatures and supported types for the Attention operator version 23+, reflecting support for bfloat16 and the new 4D input format.

NOT supported in this PR

Boolean mask
GQA
Softcap
Softmax precision
qk_output_mode other than -1 and 0

This reverts commit 4db63bc.

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h

onnxruntime/contrib_ops/cuda/bert/attention_data.h

tianleiwu · 2025-12-17T00:30:29Z

onnxruntime/contrib_ops/cuda/bert/attention_impl.cu

        stream, total_sequence_length, sequence_length, batch_size, num_heads,
        mask_index, mask_start, data.attention_bias, broadcast_attn_bias_dim_0, broadcast_attn_bias_dim_1,
-        data.scratch, scratch2, parameters.is_unidirectional));
+        data.scratch, scratch2, parameters.is_unidirectional, parameters.past_sequence_length));


Need not pass this since you can deduce it from 2nd and 3rd parameters:
past_sequence_length = total_sequence_length - sequence_length

I am using parameter.kv_sequence_length here. It turns out when it's cross attention with causal, the offset needs to be calculated as "total_sequence_length - kv_sequence_length". LMK what you think about the changes.

tianleiwu · 2026-01-09T06:43:00Z

onnxruntime/contrib_ops/cuda/bert/attention_impl.cu

+    ORT_RETURN_IF_ERROR(LaunchTransCtx(stream, sequence_length, batch_size, v_head_size, num_heads,
+                                       device_prop.maxThreadsPerBlock, false, temp_output, data.output));
+  }
  DUMP_TENSOR_D("Attention Output", data.output, batch_size, sequence_length, num_heads, v_head_size);


this shall be moved inside if (!parameters.output_is_Q_K_V_BNSH)

I think "Attention Output" debugging can be helpful to both cases. data.output by this point should have the result in both cases. Could you elaborate more on the concern?

tianleiwu · 2026-01-09T06:44:27Z

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h

  float scale;
  bool use_tf32;
-  bool is_4d_input = false;
+  bool output_is_Q_K_V_BNSH;  // whether the output format is Q_K_V_BNSH


how about is_output_bnsh?

titaiwangms added 2 commits October 31, 2025 22:22

refactor redundant condition checks

1e7d5ae

sync to Xavier's cpu refactors

49d7a42

titaiwangms mentioned this pull request Oct 31, 2025

Skeleton for Attention(23) on CUDA #25684

Closed

xadupre mentioned this pull request Nov 4, 2025

[Feature Request] Implement aten::scaled_dot_product_attention for CUDA EP #26364

Open

titaiwangms added 8 commits November 4, 2025 23:47

Merge branch 'main' into titaiwang/support_attention_cuda

246a4d1

fix attention-cpu build

f244983

draft

8274bb1

lint - draft

78f5d61

Merge branch 'main' into titaiwang/support_attention_cuda

53d4e83

fix typo

43623ad

typo-2

08f15f6

update namespace

0e75443

titaiwangms added the ep:CUDA issues related to the CUDA execution provider label Nov 19, 2025

titaiwangms added 17 commits November 19, 2025 21:35

Merge branch 'main' into titaiwang/support_attention_cuda

277648d

update doc

5253dd0

removed deprecated functions in onnx

4db63bc

Revert "removed deprecated functions in onnx"

0a7e5f9

This reverts commit 4db63bc.

Merge branch 'main' into titaiwang/support_attention_cuda

6b18bb4

fix qkv space - support 3d default

a1ed3d9

turn 4d to tru on disable cuda

b462930

refactor attn_mask

0494e95

simplify

2dc706a

Merge branch 'main' into titaiwang/support_attention_cuda

5f0b6cd

support 4d and fix attn_mask bug

88e631c

disregard softcap and softmax_precision

000d394

Merge branch 'main' into titaiwang/support_attention_cuda

739e88f

fix offset in is_causal

6d6d478

add past_seq_length to softmax bias add for causal

792445a

resolve merge conflict

a26d812

update failing cuda tests

fbbf0b5

titaiwangms marked this pull request as ready for review December 16, 2025 23:47

titaiwangms requested review from tianleiwu and xadupre December 16, 2025 23:53

tianleiwu reviewed Dec 17, 2025

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h Outdated Show resolved Hide resolved

tianleiwu reviewed Dec 17, 2025

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/attention_data.h Outdated Show resolved Hide resolved

tianleiwu reviewed Dec 17, 2025

View reviewed changes

titaiwangms added 2 commits January 8, 2026 20:44

Merge branch 'main' into titaiwang/support_attention_cuda

7010308

delete past_sequence_length and use flag output_is_Q_K_V_BNSH

2c793b6

tianleiwu reviewed Jan 9, 2026

View reviewed changes

add kv_sequence_length to softmax

ab41d04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attenion(23) CUDA #26466

Attenion(23) CUDA #26466

Uh oh!

titaiwangms commented Oct 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tianleiwu Dec 17, 2025

Uh oh!

titaiwangms Jan 9, 2026

Uh oh!

tianleiwu Jan 9, 2026

Uh oh!

titaiwangms Jan 9, 2026

Uh oh!

tianleiwu Jan 9, 2026

Uh oh!

titaiwangms Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Attenion(23) CUDA #26466

Are you sure you want to change the base?

Attenion(23) CUDA #26466

Uh oh!

Conversation

titaiwangms commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

titaiwangms Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

titaiwangms Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

titaiwangms Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

titaiwangms commented Oct 31, 2025 •

edited

Loading