-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Support group query attention in Attention(23) CUDA #27082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Copilot
wants to merge
24
commits into
main
Choose a base branch
from
copilot/support-group-query-attention
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 18 commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
e538f4b
Initial plan
Copilot f2007a2
Add GQA support to Attention(23) CUDA operator
Copilot 53c333f
Add debug tracking fields and num_splits parameter
Copilot 042ff32
Fix code review issues: use v_head_size and parameters.softcap
Copilot 0e7a632
Set softcap to 0.0f explicitly with comment
Copilot 2e10874
Enable CUDA tests for GQA attention tests
Copilot b86acbd
Remove GQA test filters from disabled tests list
Copilot 213a82d
Add float template instantiation for GQA QkvToContext
Copilot f79c509
Revert float support for GQA and add type validation
Copilot e52efb2
change gqa tests to fp16
titaiwangms 98c5dcf
examine gqa parameters and move down MHA parameters
titaiwangms 0f800f5
Merge branch 'main' into copilot/support-group-query-attention
titaiwangms 4978e96
support gqa bool masking
titaiwangms 4c644e2
add flash/memory draft
titaiwangms f04b38e
Merge branch 'main' into copilot/support-group-query-attention
titaiwangms 16d5453
finish gqa default
titaiwangms 54d77ae
Apply suggestion from @titaiwangms
titaiwangms 87a5648
introduce python attention tests for gqa
titaiwangms 5981041
lint
titaiwangms 6d7e50a
support attn_mask
titaiwangms d1cb063
Merge branch 'main' into copilot/support-group-query-attention
titaiwangms e2a4032
clean up and use ORT_MAKE_STATUS
titaiwangms dcb937a
Merge branch 'main' into copilot/support-group-query-attention
titaiwangms 2509464
fix cpu bugs on fp16
titaiwangms File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Currently, we do not support 4D inputs of QKV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added exeptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The support requires kernel changes in FlashAttention and EfficientAttention. If we want to support 4d, the best way would be another cuda kernel to transpose/reshape the input from 4d to 3d before feeding it to those two attention kernels.