Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could prime support up to 70B llama3 model? #203

Open
lizdongkun opened this issue Jan 19, 2025 · 0 comments
Open

Could prime support up to 70B llama3 model? #203

lizdongkun opened this issue Jan 19, 2025 · 0 comments

Comments

@lizdongkun
Copy link

lizdongkun commented Jan 19, 2025

Hi Expert,

Currently if name_model = "70B" is configured, and torchrun in prime framework is launched on single server with 8 GPU, the launch on peer2 will fail due to some random rank fail. sometimes it report cuda out of memory, sometimes it fails without any specific reason.
Is it expected or not? and what kinds of configuration and model parameters could be used for larger model (for example, 70B)?

Thanks!!

Regards,
Kun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant