Could prime support up to 70B llama3 model? #203

lizdongkun · 2025-01-19T10:47:40Z

Hi Expert,

Currently if name_model = "70B" is configured, and torchrun in prime framework is launched on single server with 8 GPU, the launch on peer2 will fail due to some random rank fail. sometimes it report cuda out of memory, sometimes it fails without any specific reason.
Is it expected or not? and what kinds of configuration and model parameters could be used for larger model (for example, 70B)?

Thanks!!

Regards,
Kun

HariSeldon11988 · 2025-03-26T11:22:35Z

I'm not 100% sure, but in my opinion prime uses hybrid sharding policy for FSDP. That means you can only train a model that fits onto one node. And on my experience the maximum limit is approx 10B before you get OOM with an 80GB A100/H100 node using 8GPUs. So even if you use more nodes, it will not work. (please someone correct me if I'm wrong)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Could prime support up to 70B llama3 model? #203

Could prime support up to 70B llama3 model? #203

lizdongkun commented Jan 19, 2025 •

edited

Loading

HariSeldon11988 commented Mar 26, 2025 •

edited

Loading

Could prime support up to 70B llama3 model? #203

Could prime support up to 70B llama3 model? #203

Comments

lizdongkun commented Jan 19, 2025 • edited Loading

HariSeldon11988 commented Mar 26, 2025 • edited Loading

lizdongkun commented Jan 19, 2025 •

edited

Loading

HariSeldon11988 commented Mar 26, 2025 •

edited

Loading