Sequence Parallelism, memory usage question #3803
EthanChen1234
started this conversation in
Community | General
Replies: 1 comment 2 replies
-
I'm still waiting. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
the paper, "Sequence Parallelism: Long Sequence Training from System Perspective"
for the tensor parallelism, the activations and trainable weight consume memory.
the activations, contain 4BLH/N and BLH. the trainable weight should be 4H^2/N + 4H^2/N = 8H^2/N.
I'm confused, the 32H^2/N memory use.
Beta Was this translation helpful? Give feedback.
All reactions