Add grouped linear layer with strided BMM optimization #263

z52527 · 2026-01-06T10:08:03Z

Problem

Apply num_groups different linear transformations to corresponding slices of input:

Input:  x of shape (B * num_groups, input_dim)
Output: y of shape (B * num_groups, output_dim)

For each group n: y[b, n, :] = x[b, n, :] @ W[n, :, :]

Reference Implementation

The straightforward approach uses a loop over groups:

x = x.reshape(B, num_groups, D_in)
x_split = torch.split(x, 1, dim=1)

out_list = []
for i in range(num_groups):
    x_i = x_split[i].squeeze(1)           # (B, D_in)
    out_i = linear_layers[i](x_i)         # (B, D_out)
    out_list.append(out_i)

output = torch.stack(out_list, dim=1).reshape(-1, D_out)

Optimized Implementation

Use torch.bmm with strided output to fuse all GEMMs into one kernel:

x = x.reshape(B, num_groups, D_in)
output = torch.empty(B, num_groups, D_out, ...)   # pre-allocate final layout
torch.bmm(x.permute(1,0,2), weight,
          out=output.permute(1,0,2))              # cuBLAS writes to strided memory
return output.view(-1, D_out)                     # O(1) view, no copy

Key feature: cuBLAS strided batched GEMM supports strided output via ldc/strideC parameters, allowing direct write to the transposed memory layout.

Performance Results

Config: batch_size=2560, num_groups=12, input_dim=1024, output_dim=3072, dtype=bf16
Device: NVIDIA H100

	Speedup
Forward	1.46x
Forward + Backward	1.41x

Device: NVIDIA A100

	Speedup	TFLOPS
Forward	1.67x	246.7
Forward + Backward	1.34x	238.0

JacoCheung · 2026-01-07T03:30:39Z

@z52527 ,
Could you generalize the BmmImpl such that it could handle the activation of either [batch_count, batch_size, input_dim] or [batch_size, batch_count, input_dim]? Even though the input is [batch_count*batch_size, input_dim], your impl assumes that input is [batch_size, batch_count, input_dim].

Add optimal & reference implementations with tests and benchmarks.

eb2351c

z52527 self-assigned this Jan 6, 2026

Add parameter batch_first to generalize GroupedLinear.

553387d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add grouped linear layer with strided BMM optimization #263

Add grouped linear layer with strided BMM optimization #263

Uh oh!

z52527 commented Jan 6, 2026 •

edited

Loading

Uh oh!

JacoCheung commented Jan 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add grouped linear layer with strided BMM optimization #263

Are you sure you want to change the base?

Add grouped linear layer with strided BMM optimization #263

Uh oh!

Conversation

z52527 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Reference Implementation

Optimized Implementation

Performance Results

Uh oh!

JacoCheung commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

z52527 commented Jan 6, 2026 •

edited

Loading

JacoCheung commented Jan 7, 2026 •

edited

Loading