support for llama multipack using updated code/patches #1754

winglian · 2024-07-15T13:03:49Z

The attention monkey patch we have for llama is pretty old at this point and having to maintain it is a pain. Swapping to the updated unpad patch for flash attention, and did a slight refactor to continue to support the cross entropy loss and rms norm patches.

As we can see, it's slightly faster, uses about 1.4% less VRAM and has pretty similar loss and grad norm characteristics.

I also attempted to use the updated triton RMS Norm over the CUDA implementation of RMS norm from flash attn and made things slightly worse.

winglian added 6 commits July 16, 2024 16:00

support for llama multipack using updated code/patches

4d990a9

also support unsloth patches

7831972

incorrect arg

b9fa3aa

add config validation for unsloth

cdb7607

add missing return to validation

1f684f4

add another missing return to validation

f04a282

winglian force-pushed the llama-multipack-v2 branch from 1b06247 to f04a282 Compare July 16, 2024 20:01

winglian merged commit 5f58555 into main Jul 16, 2024
8 checks passed

winglian deleted the llama-multipack-v2 branch July 16, 2024 21:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for llama multipack using updated code/patches #1754

support for llama multipack using updated code/patches #1754

winglian commented Jul 15, 2024

support for llama multipack using updated code/patches #1754

support for llama multipack using updated code/patches #1754

Conversation

winglian commented Jul 15, 2024