Skip to content

Commit

Permalink
added how to contrib code to NVFlare
Browse files Browse the repository at this point in the history
  • Loading branch information
dirkpetersen committed Jun 16, 2024
1 parent f639259 commit 53804e4
Show file tree
Hide file tree
Showing 4 changed files with 110 additions and 78 deletions.
82 changes: 70 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ The central NVFlare dashboard and server was installed by the `Project Admin`, t
- [Pytorch poc mode](#pytorch-poc-mode)
- [Troubleshooting](#troubleshooting)
- [missing server dependencies](#missing-server-dependencies)
- [No default VPC](#no-default-vpc)
- [SSL issue](#ssl-issue)
- [Using NVFlare as an Org Admin](#using-nvflare-as-an-org-admin)
- [Register client sites](#register-client-sites)
Expand All @@ -48,7 +49,7 @@ The central NVFlare dashboard and server was installed by the `Project Admin`, t
- [Resetting Dashboard](#resetting-dashboard)
- [Installing Server](#installing-server)
- [Installing Client](#installing-client)

- [Contributing code to NVFlare](#contributing-code-to-nvflare)

# Installing NVFlare deploy environment

Expand Down Expand Up @@ -346,6 +347,10 @@ options:

If you get an `Error 113` in the server log, this might mean that a dependency on the server is missing. For example, the NVFlare hello-pt example does not only require Pytorch on the clients but also on the server. To confirm the root cause, use the FLARE console (admin CLI) to login, and execute command download_job [job-id] to get the entire workspace folder. You will find it in the transfer folder of the console. Please check the workspace/log.txt inside the job folder for more details.

#### No default VPC

If you receive a VPC error such as (`VPCIdNotSpecified`) it means that no default network configuration ([Default VPC](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html)) has been created by your AWS administrator. Default VPCs are often used in smaller test envionments. You can create a default VPC by using this command: `aws ec2 create-default-vpc --region us-west-2` . If that fails you may not have permission to create this and have to reach out to your AWS Administrator for a solution. In NVFlare versions >= 2.4.1 you are given an option to pick your own --vpc-id and --subnet-id.

#### SSL issue

You may get this SSL error in log.txt with some versions of Python and Red Hat linux
Expand Down Expand Up @@ -410,11 +415,10 @@ then you add the packages you need in the client to `startup/requirements.txt` :
echo -e "torch \ntorchvision \ntensorboard" >> startup/requirements.txt
```

now you have the option of using an improved patched version of the AWS installer which allows you to skip many of the [additional configuration steps](#additional-configuration-steps) below. To use the patched version run these commands:
now you have the option of using an improved patched version of the AWS installer which allows you to skip many of the [additional configuration steps](#additional-configuration-steps) below. To use the patched version simply run this command to download and replace the existing aws_start.sh script:

```bash
wget https://raw.githubusercontent.com/dirkpetersen/nvflare-cancer/main/aws_start.sh.patch -O aws_start.sh.patch
patch startup/aws_start.sh < aws_start.sh.patch
wget https://raw.githubusercontent.com/dirkpetersen/nvflare-cancer/main/aws_start.sh -O startup/aws_start.sh
```

After this, run the `startup/start.sh` script or follow [these instructions to install the client on AWS](https://nvflare.readthedocs.io/en/main/real_world_fl/cloud_deployment.html#deploy-fl-client-on-aws):
Expand All @@ -423,6 +427,8 @@ After this, run the `startup/start.sh` script or follow [these instructions to i
startup/start.sh --cloud aws # you can get more automation by using: --config my_config.txt
```

**Note**: If you receive a VPC error such as (`VPCIdNotSpecified`), you may be able to mitigate the issue by using this command: `aws ec2 create-default-vpc --region us-west-2`. You can find more details in the troubleshooting section under [No default VPC](#no-default-vpc)

**Below we assume you use the patched version**

Now you need to confirm or change a few default settings. After confirming your AWS region you can edit the AMI image name (which supports wildcards *), that is used to search AWS for an AMI image ID for your specific AWS region. Our default here is Ubuntu 22.04 as it has the latest supported Python version (3.10). You can also change amd64 to arm64 as ARM based instances are sometimes lower cost.
Expand All @@ -434,8 +440,8 @@ Note: run this command first for a different AWS profile:
* Cloud EC2 region, press ENTER to accept default: us-west-2
* Cloud AMI image name, press ENTER to accept default (use amd64 or arm64): ubuntu-*-22.04-arm64-pro-server
retrieving AMI ID for ubuntu-*-22.04-arm64-pro-server...
finding smallest instance type with 1 GPUs and 15360 MiB VRAM ... g5g.xlarge
retrieving AMI ID for ubuntu-*-22.04-arm64-pro-server ... ami-0d0b0cfbf4ce38093 found
finding smallest instance type with 1 GPUs and 15360 MiB VRAM ... g5g.xlarge found
* Cloud EC2 type, press ENTER to accept default: g5g.xlarge
* Cloud AMI image id, press ENTER to accept default: ami-0d0b0cfbf4ce38093
region = us-west-2, EC2 type = g5g.xlarge, ami image = ami-0d0b0cfbf4ce38093 , OK? (Y/n)
Expand All @@ -457,26 +463,26 @@ Installing os packages with apt in nvflare_client, this may take a few minutes .
Installing user space packages in nvflare_client, this may take a few minutes ...
System was provisioned
To terminate the EC2 instance, run the following command:
aws ec2 terminate-instances --instance-ids i-0dbbd2fb9a37c6783
aws ec2 terminate-instances --region us-west-2 --instance-ids i-0dbbd2fb9a37c6783
Other resources provisioned
security group: nvflare_client_sg_5254
key pair: NVFlareClientKeyPair
review install progress:
tail -f /tmp/nvflare.log
tail -f /tmp/nvflare-aws-YGR.log
login to instance:
ssh -i /home/dp/NVFlare/NVFlareClientKeyPair_i-0dbbd2fb9a37c6783.pem [email protected]
```

Now try logging in :

```bash
ssh -i /home/dp/NVFlare/NVFlareClientKeyPair.pem [email protected]
ssh -i /home/dp/NVFlare/NVFlareClientKeyPair_i-0dbbd2fb9a37c6783.pem [email protected]
```

or wait until the install has finished, you can check progress in /tmp/nvflare.log on your machine:
or wait until the install has finished, you can check progress in /tmp/nvflare-aws-YGR.log on your machine:

```bash
tail -f /tmp/nvflare.log
tail -f /tmp/nvflare-aws-YGR.log
```

#### additional configuration steps
Expand Down Expand Up @@ -695,7 +701,7 @@ The NVFlare dashboard will be created in an isolated AWS account. Please see the
nvflare dashboard --cloud aws
```

If you receive a VPC error such as (`VPCIdNotSpecified`) it means that no default network configuration ([Default VPC](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html)) has been created by your AWS administrator. Default VPCs are often used in smaller test envionments. You can create a default VPC by using this command: `aws ec2 create-default-vpc` . If that fails you may not have permission to create this and have to reach out to your AWS Administrator for a solution. In NVFlare versions > 2.4 you will also be able to pick your own VPC.
**Note**: If you receive a VPC error such as (`VPCIdNotSpecified`), you may be able to mitigate the issue by using this command: `aws ec2 create-default-vpc --region us-west-2`. You can find more details in the troubleshooting section under [No default VPC](#no-default-vpc)

After the dashboard is started you will see a dashboard URL that includes an IP address and looks like `http://xxx.xxx.xxx.xxx:443`. Make sure you record the email address and the 5 digit initial password that is displayed in the terminal. Verify that you can login with email address as the user and the password at that URL. You can change your password at `MY INFO -> Edit My Profile`

Expand Down Expand Up @@ -950,3 +956,55 @@ sudo reboot
## Installing Client

please see [Using NVFlare as an Org Admin](#using-nvflare-as-an-org-admin)

# Contributing code to NVFlare

If you would like to make a code contribution to NVFlare, please check the [contributor docs](https://nvflare.readthedocs.io/en/main/contributing.html) first.
In our case, we have made modificaitons to the cloud deployment scripts and constributed some changes back. Please take these steps after [creating a Fork](https://github.com/NVIDIA/NVFlare/fork) of NVFlare:

```
git clone [email protected]:your-organization/NVFlare.git
git clone [email protected]:dirkpetersen/nvflare-cancer.git
cd NVFlare
```

check folder `nvflare\lighter\impl` and make modifications to `aws_template.yml` and/or `master_template.yml`. Then generate a new aws_start.sh script for an NVFlare client in one of your client starter kits startup folder:

```
../nvflare-cancer/make-aws-client-script.py /starter-kit-folder/startup/aws_start.sh
```

Test this aws_start.sh script intensely before you run `runtest.sh` and commit the code to your forked NVFlare repository and then create a pull request in Github. The make-aws-client-script.py uses the NVFlare internal machinery to generate shell scripts from yaml files:

```python
#! /usr/bin/env python3

import os, sys
from nvflare.lighter import tplt_utils, utils

client = "AWS-T4"
org = "Test"

lighter_folder = os.path.dirname(utils.__file__)
template = utils.load_yaml(os.path.join(lighter_folder, "impl", "master_template.yml"))
template.update(utils.load_yaml(os.path.join(lighter_folder, "impl", "aws_template.yml")))
tplt = tplt_utils.Template(template)
csp = 'aws'
if len(sys.argv) > 1:
dest = sys.argv[1]
else:
dest = os.path.join(os.getcwd(), f"{csp}_start.sh")
script = template[f"cloud_script_header"] + template[f"{csp}_start_sh"]
script = utils.sh_replace(
script, {"type": "client", "inbound_rule": "", "cln_uid": f"uid={client}", "ORG": org}
)
utils._write(
dest,
script,
"t",
exe=True,
)
print(f"Script written to {dest} !")
```


23 changes: 12 additions & 11 deletions aws_start.sh
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ function prompt() {
fi
}

get_resources_file() {
function get_resources_file() {
local rfile="${DIR}/../local/resources.json"
if [ -f "${rfile}" ]; then
echo "${rfile}"
Expand All @@ -40,7 +40,7 @@ get_resources_file() {
fi
}

find_ec2_gpu_instance_type() {
function find_ec2_gpu_instance_type() {
local gpucnt=0
local gpumem=0
if rfile=$(get_resources_file); then
Expand Down Expand Up @@ -89,19 +89,20 @@ do
esac
shift
done
TMPDIR="${TMPDIR:-/tmp}"
LOGFILE=$(mktemp "${TMPDIR}/nvflare-aws-XXX")
VM_NAME=nvflare_client
SECURITY_GROUP=nvflare_client_sg_$RANDOM
KEY_PAIR=NVFlareClientKeyPair
KEY_FILE=$(pwd)/${KEY_PAIR}.pem
IMAGE_OWNER="099720109477" # Owner account id=Amazon
ARCH=x86_64
AMI_IMAGE_OWNER="099720109477" # Owner account id=Amazon
AMI_NAME="ubuntu-*-22.04-amd64-pro-server"
AMI_ARCH=x86_64
EC2_TYPE_ARM=t4g.small

AMI_IMAGE=ami-01ed44191042f130f # 22.04 20.04:ami-063da375c17d500ab 24.04:ami-0833a2b4abf788b34 (us-west-2 only)
EC2_TYPE=t2.small
EC2_TYPE_ARM=t4g.small
NVIDIA_OS_PKG="nvidia-driver-550-server"
TMPDIR="${TMPDIR:-/tmp}"
LOGFILE=$(mktemp "${TMPDIR}/nvflare-aws-XXX")


echo "This script requires aws (AWS CLI), sshpass, dig and jq. Now checking if they are installed."

Expand Down Expand Up @@ -156,7 +157,7 @@ if [ $useDefault = true ]; then
if [ ${container} = false ]; then
read -e -i ${AMI_NAME} -p "* Cloud AMI image name, press ENTER to accept default (use amd64 or arm64): " AMI_NAME
printf " retrieving AMI ID for ${AMI_NAME} ... "
IMAGES=$(aws ec2 describe-images --region ${REGION} --owners ${IMAGE_OWNER} --filters "Name=name,Values=*${AMI_NAME}*" --output json)
IMAGES=$(aws ec2 describe-images --region ${REGION} --owners ${AMI_IMAGE_OWNER} --filters "Name=name,Values=*${AMI_NAME}*" --output json)
if [ "${#IMAGES}" -lt 30 ]; then
echo -e "\nNo images found, starting over\n"
continue
Expand Down Expand Up @@ -278,7 +279,7 @@ if [ $container = true ]; then
report_status "$?" "launching container"
else
# Spawn a process to install os packages as root
echo "Installing os packages as root in the background, this may take a few minutes ... "
echo "Installing os packages as root in $VM_NAME, may take a few minutes ... "
ssh -f -i ${KEY_FILE2} -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ${DEST_SITE} \
' NVIDIA_OS_PKG="nvidia-driver-550-server" && sudo apt update && \
sudo DEBIAN_FRONTEND=noninteractive apt install -y python3-dev gcc && \
Expand All @@ -289,7 +290,7 @@ else
report_status "$?" "installing os packages"
sleep 10
# Spawn a process to install packages as user
echo "Installing user space packages in the background, this may take a few minutes ... "
echo "Installing user space packages in $VM_NAME, may take a few minutes ... "
ssh -f -i ${KEY_FILE2} -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ${DEST_SITE} \
' echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc && \
export PATH=/home/ubuntu/.local/bin:$PATH && \
Expand Down
55 changes: 0 additions & 55 deletions aws_start.sh.2.4.0.patch

This file was deleted.

28 changes: 28 additions & 0 deletions make-aws-client-script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#! /usr/bin/env python3

import os, sys
from nvflare.lighter import tplt_utils, utils

client = "AWS-T4"
org = "Test"

lighter_folder = os.path.dirname(utils.__file__)
template = utils.load_yaml(os.path.join(lighter_folder, "impl", "master_template.yml"))
template.update(utils.load_yaml(os.path.join(lighter_folder, "impl", "aws_template.yml")))
tplt = tplt_utils.Template(template)
csp = 'aws'
if len(sys.argv) > 1:
dest = sys.argv[1]
else:
dest = os.path.join(os.getcwd(), f"{csp}_start.sh")
script = template[f"cloud_script_header"] + template[f"{csp}_start_sh"]
script = utils.sh_replace(
script, {"type": "client", "inbound_rule": "", "cln_uid": f"uid={client}", "ORG": org}
)
utils._write(
dest,
script,
"t",
exe=True,
)
print(f"Script written to {dest} !")

0 comments on commit 53804e4

Please sign in to comment.