You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently if name_model = "70B" is configured, and torchrun in prime framework is launched on single server with 8 GPU, the launch on peer2 will fail due to some random rank fail. sometimes it report cuda out of memory, sometimes it fails without any specific reason.
Is it expected or not? and what kinds of configuration and model parameters could be used for larger model (for example, 70B)?
Thanks!!
Regards,
Kun
The text was updated successfully, but these errors were encountered:
Hi Expert,
Currently if name_model = "70B" is configured, and torchrun in prime framework is launched on single server with 8 GPU, the launch on peer2 will fail due to some random rank fail. sometimes it report cuda out of memory, sometimes it fails without any specific reason.
Is it expected or not? and what kinds of configuration and model parameters could be used for larger model (for example, 70B)?
Thanks!!
Regards,
Kun
The text was updated successfully, but these errors were encountered: