Implementation of the decentralized learning algorithm RelaySGD1 inside of Bagua2.
You can run the benchmark using an installed version of bagua with:
python3 -m bagua.distributed.launch --nproc_per_node=<number of gpus> benchmark.py --algorithm relay
You can also provide some parameters:
python3 -m bagua.distributed.launch --nproc_per_node=<number of gpus> benchmark.py --algorithm relay --lr <learning rate> --alpha <data heterogeneity parameter> --topology <relay togology e.g. chain>
The logs
folder contains the output of all the runs.
To tune the hyperparameters, modify and run the following scripts: hpt_relay.sh
and hpt_rest.sh
. The output is saved in the logs folder as summary*.txt
. The final_run.sh
script executes the below shown experiment using the best learning rates on 8 GPUs.
The second experiment evaluates the throughput of different algorithms. (synth_benchmark_run.sh
)