Skip to content

Commit

Permalink
[2.5] Fix TF examples (NVIDIA#3038)
Browse files Browse the repository at this point in the history
* Fix TF examples

* Fix link

* Update some texts

* Undo changes
  • Loading branch information
YuanTingHsieh committed Nov 27, 2024
1 parent da6f808 commit bad662d
Show file tree
Hide file tree
Showing 5 changed files with 23 additions and 72 deletions.
15 changes: 5 additions & 10 deletions examples/advanced/job_api/tf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,8 @@ All examples in this folder are based on using [TensorFlow](https://tensorflow.o

## Simulated Federated Learning with CIFAR10 Using Tensorflow

This example shows `Tensorflow`-based classic Federated Learning
algorithms, namely FedAvg and FedOpt on CIFAR10
dataset. This example is analogous to [the example using `Pytorch`
This example demonstrates TensorFlow-based federated learning algorithms on the CIFAR-10 dataset.
This example is analogous to [the example using `Pytorch`
backend](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/cifar10/cifar10-sim)
on the same dataset, where same experiments
were conducted and analyzed. You should expect the same
Expand All @@ -21,7 +20,7 @@ client-side training logics (details in file
and the new
[`FedJob`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/job_config/api.py)
APIs were used to programmatically set up an
`nvflare` job to be exported or ran by simulator (details in file
NVFlare job to be exported or ran by simulator (details in file
[`tf_fl_script_runner_cifar10.py`](tf_fl_script_runner_cifar10.py)),
alleviating the need of writing job config files, simplifying
development process.
Expand Down Expand Up @@ -65,12 +64,8 @@ script.
> `export TF_FORCE_GPU_ALLOW_GROWTH=true && export
> TF_GPU_ALLOCATOR=cuda_malloc_asyncp`
The set-up of all experiments in this example are kept the same as
[the example using `Pytorch`
backend](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/cifar10/cifar10-sim). Refer
to the `Pytorch` example for more details. Similar to the Pytorch
example, we here also use Dirichelet sampling on CIFAR10 data labels
to simulate data heterogeneity among data splits for different client
We use Dirichelet sampling (implementation from FedMA (https://github.com/IBM/FedMA)) on
CIFAR10 data labels to simulate data heterogeneity among data splits for different client
sites, controlled by an alpha value, ranging from 0 (not including 0)
to 1. A high alpha value indicates less data heterogeneity, i.e., an
alpha value equal to 1.0 would result in homogeneous data distribution
Expand Down
12 changes: 6 additions & 6 deletions examples/advanced/job_api/tf/run_jobs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ GPU_INDX=0
WORKSPACE=/tmp

# Run centralized training job
python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo centralized \
--n_clients 1 \
--num_rounds 25 \
Expand All @@ -39,7 +39,7 @@ python ./tf_fl_script_executor_cifar10.py \
# Run FedAvg with different alpha values
for alpha in 1.0 0.5 0.3 0.1; do

python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo fedavg \
--n_clients 8 \
--num_rounds 50 \
Expand All @@ -53,7 +53,7 @@ done


# Run FedOpt job
python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo fedopt \
--n_clients 8 \
--num_rounds 50 \
Expand All @@ -65,7 +65,7 @@ python ./tf_fl_script_executor_cifar10.py \


# Run FedProx job.
python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo fedprox \
--n_clients 8 \
--num_rounds 50 \
Expand All @@ -77,11 +77,11 @@ python ./tf_fl_script_executor_cifar10.py \


# Run scaffold job
python ./tf_fl_script_executor_cifar10.py \
python ./tf_fl_script_runner_cifar10.py \
--algo scaffold \
--n_clients 8 \
--num_rounds 50 \
--batch_size 64 \
--epochs 4 \
--alpha 0.1 \
--gpu $GPU_INDX
--gpu $GPU_INDX
29 changes: 10 additions & 19 deletions examples/getting_started/tf/README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,21 @@
# Getting Started with NVFlare (TensorFlow)
[![TensorFlow Logo](https://upload.wikimedia.org/wikipedia/commons/a/ab/TensorFlow_logo.svg)](https://tensorflow.org/)

We provide several examples to quickly get you started using NVFlare's Job API.
We provide several examples to help you quickly get started with NVFlare.
All examples in this folder are based on using [TensorFlow](https://tensorflow.org/) as the model training framework.

## Simulated Federated Learning with CIFAR10 Using Tensorflow

This example shows `Tensorflow`-based classic Federated Learning
algorithms, namely FedAvg and FedOpt on CIFAR10
dataset. This example is analogous to [the example using `Pytorch`
backend](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/cifar10/cifar10-sim)
on the same dataset, where same experiments
were conducted and analyzed. You should expect the same
experimental results when comparing this example with the `Pytorch` one.
This example demonstrates TensorFlow-based federated learning algorithms,
FedAvg and FedOpt, on the CIFAR-10 dataset.

In this example, the latest Client APIs were used to implement
client-side training logics (details in file
[`cifar10_tf_fl_alpha_split.py`](src/cifar10_tf_fl_alpha_split.py)),
and the new
[`FedJob`](https://github.com/NVIDIA/NVFlare/blob/main/nvflare/job_config/api.py)
APIs were used to programmatically set up an
`nvflare` job to be exported or ran by simulator (details in file
NVFlare job to be exported or ran by simulator (details in file
[`tf_fl_script_runner_cifar10.py`](tf_fl_script_runner_cifar10.py)),
alleviating the need of writing job config files, simplifying
development process.
Expand Down Expand Up @@ -64,12 +59,8 @@ script.
> `export TF_FORCE_GPU_ALLOW_GROWTH=true && export
> TF_GPU_ALLOCATOR=cuda_malloc_asyncp`
The set-up of all experiments in this example are kept the same as
[the example using `Pytorch`
backend](https://github.com/NVIDIA/NVFlare/tree/main/examples/advanced/cifar10/cifar10-sim). Refer
to the `Pytorch` example for more details. Similar to the Pytorch
example, we here also use Dirichelet sampling on CIFAR10 data labels
to simulate data heterogeneity among data splits for different client
We use Dirichelet sampling (implementation from FedMA (https://github.com/IBM/FedMA)) on
CIFAR10 data labels to simulate data heterogeneity among data splits for different client
sites, controlled by an alpha value, ranging from 0 (not including 0)
to 1. A high alpha value indicates less data heterogeneity, i.e., an
alpha value equal to 1.0 would result in homogeneous data distribution
Expand Down Expand Up @@ -111,11 +102,11 @@ for alpha in 1.0 0.5 0.3 0.1; do
done
```

## 2. Results
## 3. Results

Now let's compare experimental results.

### 2.1 Centralized training vs. FedAvg for homogeneous split
### 3.1 Centralized training vs. FedAvg for homogeneous split
Let's first compare FedAvg with homogeneous data split
(i.e. `alpha=1.0`) and centralized training. As can be seen from the
figure and table below, FedAvg can achieve similar performance to
Expand All @@ -129,7 +120,7 @@ no difference in data distributions among different clients.

![Central vs. FedAvg](./figs/fedavg-vs-centralized.png)

### 2.2 Impact of client data heterogeneity
### 3.2 Impact of client data heterogeneity

Here we compare the impact of data heterogeneity by varying the
`alpha` value, where lower values cause higher heterogeneity. As can
Expand All @@ -145,7 +136,7 @@ as data heterogeneity becomes higher.

![Impact of client data
heterogeneity](./figs/fedavg-diff-alphas.png)

> [!NOTE]
> More examples can be found at https://nvidia.github.io/NVFlare.
35 changes: 0 additions & 35 deletions examples/getting_started/tf/run_jobs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,38 +50,3 @@ for alpha in 1.0 0.5 0.3 0.1; do
--workspace $WORKSPACE

done


# Run FedOpt job
python ./tf_fl_script_runner_cifar10.py \
--algo fedopt \
--n_clients 8 \
--num_rounds 50 \
--batch_size 64 \
--epochs 4 \
--alpha 0.1 \
--gpu $GPU_INDX \
--workspace $WORKSPACE


# Run FedProx job.
python ./tf_fl_script_runner_cifar10.py \
--algo fedprox \
--n_clients 8 \
--num_rounds 50 \
--batch_size 64 \
--epochs 4 \
--fedprox_mu 1e-5 \
--alpha 0.1 \
--gpu $GPU_INDX


# Run scaffold job
python ./tf_fl_script_runner_cifar10.py \
--algo scaffold \
--n_clients 8 \
--num_rounds 50 \
--batch_size 64 \
--epochs 4 \
--alpha 0.1 \
--gpu $GPU_INDX
4 changes: 2 additions & 2 deletions examples/hello-world/hello-tf/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Example of using [NVIDIA FLARE](https://nvflare.readthedocs.io/en/main/index.htm
using federated averaging ([FedAvg](https://arxiv.org/abs/1602.05629))
and [TensorFlow](https://tensorflow.org/) as the deep learning training framework.

> **_NOTE:_** This example uses the [MNIST](http://yann.lecun.com/exdb/mnist/) handwritten digits dataset and will load its data within the trainer code.
> **_NOTE:_** This example uses the [MNIST](https://www.tensorflow.org/datasets/catalog/mnist) handwritten digits dataset and will load its data within the trainer code.
See the [Hello TensorFlow](https://nvflare.readthedocs.io/en/main/examples/hello_tf_job_api.html#hello-tf-job-api) example documentation page for details on this
example.
Expand Down Expand Up @@ -48,7 +48,7 @@ In scenarios where multiple clients are involved, you have to prevent TensorFlow
by setting the following flags.

```bash
TF_FORCE_GPU_ALLOW_GROWTH=true TF_GPU_ALLOCATOR=cuda_malloc_async
TF_FORCE_GPU_ALLOW_GROWTH=true TF_GPU_ALLOCATOR=cuda_malloc_async python3 fedavg_script_runner_tf.py
```

If you possess more GPUs than clients, a good strategy is to run one client on each GPU.
Expand Down

0 comments on commit bad662d

Please sign in to comment.