Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Reland reason
Reland #26466
The previous PR was reverted because it fails on the test:
This PR includes the correct fix.
Description
This pull request introduces significant improvements and expanded support for multi-head attention kernels in ONNX Runtime, particularly focusing on supporting both 3D (
BSNH) and 4D (BNSH) QKV input formats. The changes enhance flexibility, correctness, and maintainability for attention operations across CPU and CUDA implementations.Expanded QKV Input Format Support
Q_K_V_BNSH) in CUDA attention kernels, including proper handling for both cases with and without past/present states, and enforcing that bias is not supported for this format. This includes logic to avoid unnecessary transposes and to write outputs directly when possible. [1] [2] [3] [4] [5] [6] [7]Kernel and Operator Documentation Updates
OperatorKernels.mdto document the newAttentionoperator inputs and outputs for both 3D and 4D formats, specifying supported tensor types for each input.Correctness and Consistency Fixes
Attention Parameter and Helper Refactoring
is_output_bnshfield toAttentionParametersto indicate output format and updated logic to use this for output placement and transposition decisions. [1] [2]attention_helpernamespace for output mode enums and output shape computation, improving code clarity and maintainability. [1] [2] [3]Minor Cleanups
These changes collectively improve the flexibility, correctness, and maintainability of attention kernel implementations in ONNX Runtime, especially for advanced transformer models and large language model workloads.
NOT supported in this PR