Unable to train the algorithms with 2 GPU instances (multi-node), each with 4 A100s #272

CheruscanArminius · 2024-01-05T19:41:57Z

The paper says the algorithm has been trained with 8 A100 GPUs.
I am having two instances, each equipped with 4 A100s instead of one GPU instance with 8 A100 GPUs.
Is there any way to specify the instances in the configurations? In another word, where can I specify the number of nodes in the code?
https://lightning.ai/docs/pytorch/stable/common/trainer.html#num-nodes

I would do appreciate if you could give a comment on these.

Update:

I added number of nodes to the training process and sent a pull request. In case of being accepted, this issue will be closed.

CheruscanArminius changed the title ~~Unable to train the algorithms with 2 GPU instances, each with 4 A100s~~ Unable to train the algorithms with 2 GPU instances (multi-node), each with 4 A100s Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to train the algorithms with 2 GPU instances (multi-node), each with 4 A100s #272

Unable to train the algorithms with 2 GPU instances (multi-node), each with 4 A100s #272

CheruscanArminius commented Jan 5, 2024 •

edited

Loading

Unable to train the algorithms with 2 GPU instances (multi-node), each with 4 A100s #272

Unable to train the algorithms with 2 GPU instances (multi-node), each with 4 A100s #272

Comments

CheruscanArminius commented Jan 5, 2024 • edited Loading

CheruscanArminius commented Jan 5, 2024 •

edited

Loading