Unified attention CK Tile kernel #3128

juuso-oskari · 2025-10-30T12:35:30Z

This PR implements a unified attention kernel written in CK Tile. It builds on top of the fmha_v3 (composable_kernel/example/ck_tile/01_fmha) with the pipeline largely remaining the same. This PR implements the following features introduced in Triton unified attention kernel:

reduced launch grid size at composable_kernel/example/ck_tile/01_unified_attention/unified_attention_impl.hpp

// args.num_tokens is the cumulative amount of tokens from all sequences
index_t total_num_q_blocks = args.num_tokens / BLOCK_Q + args.num_seqs;
dim3 grids            = Kernel::GridSize2D(args.num_kv_heads, total_num_q_blocks);
return launch_kernel(config, make_kernel<kBlockPerCu>(Kernel{}, grids, blocks, 0, kargs));

This is significantly less amount of programs launched compared to before grid=(num_seqs, max_seqlen // BLOCK_M, num_q_heads), which contained lots of empty programs (not all sequences are of length max_seqlen).

But since now the current sequence index cannot be taken from the program id, we need to do a binary search at the beginning of the kernel to find our sequence index (used to index sequence length; needed for determining innerloop length).

This is implemented at composable_kernel/include/ck_tile/ops/unified_attention/kernel/unified_attention_kernel.hpp:

// Binary search to find the sequence index for a given global index
CK_TILE_DEVICE static constexpr ck_tile::index_t
find_seq_idx(const int32_t* query_start_len_ptr,
                ck_tile::index_t target_idx,
                ck_tile::index_t num_seqs,
                ck_tile::index_t block_q,
                bool use_q_block_mode)
{
    ck_tile::index_t left = 0;
    ck_tile::index_t right = num_seqs;
    while (left < right)
    {
        ck_tile::index_t mid = (left + right) / 2;
        ck_tile::index_t val = query_start_len_ptr[mid];
        ck_tile::index_t mid_val = use_q_block_mode ? (val / block_q + mid) : val;
        
        if (mid_val <= target_idx)
        {
            left = mid + 1;
        }
        else
        {
            right = mid;
        }
    }
    return left - 1;
}
// usage inside the kernel
const auto [kv_head_idx, q_block_global_idx] = GetTileIndex(pid, kargs);
// grid size is (num_kv_heads, total_num_q_blocks)
// total_num_q_blocks = q.shape[0] // BLOCK_Q + num_seqs
// q.shape[0] is total number of query tokens across all batches
const index_t seq_idx = find_seq_idx(
    kargs.query_start_len_ptr, q_block_global_idx, kargs.num_seqs, BLOCK_Q, true
); // which seq am I

In order to process more query tokens per load in decode settings (where sequence length is small, often only 1), we group query tokens in the head dim. Up to num_queries_per_kv query tokens share the same key/value token (CQA-setting). The total number of grouped tokens for a tile load is BLOCK_M = BLOCK_Q * num_queries_per_kv.

We do this in the kernel implementation by transforming the tensor view for Q in dram:

const auto q_dram = [&]() {
    const auto q_dram_base = make_naive_tensor_view<address_space_enum::global>(
        q_ptr,
        make_tuple(cur_batch_query_len, num_queries_per_kv, HEAD_SIZE),
        make_tuple(kargs.query_stride_0, kargs.query_stride_1, 1),
        number<UnifiedAttentionPipeline::kAlignmentQ>{},
        number<1>{});

    const auto q_dram_pad = pad_tensor_view( // aling seqlen with BLOCK_Q and head dim with HEAD_SIZE_PADDED
        q_dram_base,
        // block sizes
        make_tuple(BLOCK_Q, 1, HEAD_SIZE_PADDED),
        sequence<true, false, kPadHeadDimQ>{}
    ); // pads to (seq_len_padded, num_head_q, HEAD_SIZE_PADDED)

    const auto q_dram_merged = transform_tensor_view(
                q_dram_pad,
                make_tuple(
                    make_merge_transform(
                        make_tuple(query_len_padded, num_queries_per_kv)
                    ),
                    make_pass_through_transform(HEAD_SIZE_PADDED)
                ),
                make_tuple(sequence<0, 1>{}, sequence<2>{}),
                make_tuple(sequence<0>{}, sequence<1>{})
    ); // flattens the first two dims, head idx is the fastest changing dim in the merged dim
    return q_dram_merged;
}();

This way, pipeline can remain untouched and use the BLOCK_M as its tile size.

build

# in the root of ck_tile
mkdir build && cd build
# you can replace <arch> with the appropriate architecture (for example gfx90a or gfx942) or leave it blank
../script/cmake-ck-dev.sh  ../ <arch>
make tile_example_unified_attention -j1

…composable_kernel into tianxing/unified-attention

…ut getting compile errors (expected atm)

…composable_kernel into tianxing/unified-attention

Chi-Chu319 and others added 30 commits October 6, 2025 13:02

intial commit

e54cb5a

unified attention rename

191f179

transform q tensor view

436eb3a

refactor

df60493

refactor. and fixed q transformation

1f4648d

Some refactor

bc6385f

refactor

36a65b1

refactor the q tensor view transformation

2d6dab2

Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/…

49ce980

…composable_kernel into tianxing/unified-attention

refactor the q tensor view transformation

af94aaf

kv tensor view

55fc6d7

Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/…

96fde33

…composable_kernel into tianxing/unified-attention

stride fix

16129a7

fix

b721f79

Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/…

81a02ff

…composable_kernel into tianxing/unified-attention

add commenting

6ba25b7

o ptr and window

be58d51

Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/…

cd35428

…composable_kernel into tianxing/unified-attention

kv tensor view and initial window

6a7fa95

fix q window origin

b37c356

fix q window

c3d27ab

pipeline api

e1120ff

Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/…

96b208f

…composable_kernel into tianxing/unified-attention

o window change

c87f2e3

kv paging

ec29289

Comments

b940a75

fix seq_len -> cur_batch_query_len

4d232d5

merge

72fe8b3

Example boostrap

853fa21

correct masking by transforming y_idx = y_idx / num_queries_per_kv

63c17b7

juuso-oskari and others added 27 commits October 16, 2025 08:57

merge

498a97a

use correct mask in kernel

6293257

fix mask

aa4908a

comment

072de38

fix order in mask caller

9940bd0

example

af9167a

Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/…

995c670

…composable_kernel into tianxing/unified-attention

fixing args

f4e8f79

modified cmake files at unified attention example. Now cmake works, b…

3f963d4

…ut getting compile errors (expected atm)

Compiling fix

9fda954

fixing compile errors...

97e7527

fixing compile errors...

d68a541

More compilation fixes

f72b994

change to BLOCK_M in shape definitions

e144872

fixing bugs

3c0e6d3

fixed example

0d2a9ba

block table stride fix

3bcef59

fixing bugs

5bf72d2

fix the vector max

e03ed35

Merge branch 'tianxing/unified-attention' of https://github.com/ROCm/…

3fe5d79

…composable_kernel into tianxing/unified-attention

removed redundent code

6ea56be

Fixed pipeline args

3bb29bf

const blockq

ebf1c4c

Fixed block Q with M

d18f8e4

Fixed block Q with M

89cfdb3

Debugging window size

22c5c20

fixed window creation number<>{}

d5c8315

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unified attention CK Tile kernel #3128

Unified attention CK Tile kernel #3128

juuso-oskari commented Oct 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Unified attention CK Tile kernel #3128

Are you sure you want to change the base?

Unified attention CK Tile kernel #3128

Conversation

juuso-oskari commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

build

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

juuso-oskari commented Oct 30, 2025 •

edited

Loading