Examples of training bias #84

drisspg · 2024-11-22T22:06:42Z

Stacked PRs:

->Examples of training bias #84

Examples of training bias

stack-info: PR: #84, branch: drisspg/stack/2

# Summary The follow up PR to: #137526. In this pr, we actually update the lowerings for the flex_attention backwards kernel to generate fused backward gradient calculations for any captured buffers that require grads. We are doing this using tl.atomic_add to scatter the correct gradients into zeroed out buffer for any captured buffers that required grads. Added many test cases and found. Along the way found some masking bugs. There are likely some performance cliffs here, specifically with D-types and on different GPUs. Planned to do this in a follow-up and profile the current strategy. We are explicitly choosing reduced memory over increased performance right now. By using atomics, we do not need to realize a full attention scores matrix. However, this comes with two downsides. One, this is potentially slower in some cases, and two, the gradient calculation for any captured buffers is non-deterministic. ## Worked Example Lets do the case where you are reading from one bias that doesn't require grad and using this to index into another that does. ScoreMod: ```Python bias = torch.randn( params.seq_length, device=self.device, dtype=params.dtype, requires_grad=True, ) offset = torch.randint( 0, params.seq_length, (params.seq_length,), device=self.device, ) def score_mod(score, b, h, q_idx, kv_idx): return score + bias[offset[q_idx]] ``` I am removing all but the new subgraph injected into the backwards: ``` Python dsT = pT * (dpT - Di[None, :]) # ~~~~~~~~~~~~~~~~~~~ Apply joint modification ~~~~~~~~~~~~~~~~~~~ grad_scores = (dsT) # ~~~~~~~~~~~~~~~~~~~ Apply other buffer grad writes ~~~~~~~~~~~~~ idx_b = off_z idx_h = off_hq idx_m = m idx_n = n scatter_mask = offs_m1[None, :] < Q_LEN and offs_n1[:, None] < KV_LEN tmp4 = (dsT).to(tl.float32) tl.atomic_add(out_ptr1 + (tl.broadcast_to(tl.load(in_ptr16 + idx_m), tmp4.shape)), tmp4, scatter_mask, sem='relaxed') # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``` ## Key points * We always accumulate to float 32 grad buffers regardless of the type in the forward. This is because we normally do all computation intra kernel w/ fp32 accumulation and we want the same behavior for atomic additions * We are currently restricted to 1 scatter in the kenrel. I have some ideas on fx rewrites that would remove this restrictions but for now have nice error message w/ work around and will leave as a follow up. * Will do more extensive performance/ memory profiling in a follow up. ### Toy E2E example I have a toy E2E training example PR in the gym for now: pytorch-labs/attention-gym#84 I plan to update to a realistic learnable bias before landing Pull Request resolved: #137452 Approved by: https://github.com/Chillee

drisspg added a commit that referenced this pull request Nov 22, 2024

Examples of training bias

0166927

stack-info: PR: #84, branch: drisspg/stack/2

drisspg force-pushed the drisspg/stack/2 branch from d2864f6 to 0166927 Compare November 22, 2024 22:06

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 22, 2024

Examples of training bias

4d045e1

stack-info: PR: #84, branch: drisspg/stack/2

drisspg force-pushed the drisspg/stack/2 branch from 0166927 to 4d045e1 Compare November 22, 2024 22:07

drisspg mentioned this pull request Nov 23, 2024

[FlexAttention] add support for learnable biases in Inductor pytorch/pytorch#137452

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Examples of training bias #84

Examples of training bias #84

drisspg commented Nov 22, 2024 •

edited

Loading

Examples of training bias #84

Are you sure you want to change the base?

Examples of training bias #84

Conversation

drisspg commented Nov 22, 2024 • edited Loading

drisspg commented Nov 22, 2024 •

edited

Loading