added how to contrib code to NVFlare

biodev · Jun 16, 2024 · 53804e4 · 53804e4
1 parent f639259
commit 53804e4
Show file tree

Hide file tree

Showing 4 changed files with 110 additions and 78 deletions.
diff --git a/README.md b/README.md
@@ -26,6 +26,7 @@ The central NVFlare dashboard and server was installed by the `Project Admin`, t
       - [Pytorch poc mode](#pytorch-poc-mode)
     - [Troubleshooting](#troubleshooting)
       - [missing server dependencies](#missing-server-dependencies)
+      - [No default VPC](#no-default-vpc)
       - [SSL issue](#ssl-issue)
   - [Using NVFlare as an Org Admin](#using-nvflare-as-an-org-admin)
     - [Register client sites](#register-client-sites)
@@ -48,7 +49,7 @@ The central NVFlare dashboard and server was installed by the `Project Admin`, t
     - [Resetting Dashboard](#resetting-dashboard)
   - [Installing Server](#installing-server)
   - [Installing Client](#installing-client)
-
+- [Contributing code to NVFlare](#contributing-code-to-nvflare)
 
 # Installing NVFlare deploy environment 
 
@@ -346,6 +347,10 @@ options:
 
 If you get an `Error 113` in the server log, this might mean that a dependency on the server is missing. For example, the NVFlare hello-pt example does not only require Pytorch on the clients but also on the server. To confirm the root cause, use the FLARE console (admin CLI) to login, and execute command download_job [job-id] to get the entire workspace folder. You will find it in the transfer folder of the console. Please check the workspace/log.txt inside the job folder for more details.
 
+#### No default VPC
+
+If you receive a VPC error such as (`VPCIdNotSpecified`) it means that no default network configuration ([Default VPC](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html)) has been created by your AWS administrator. Default VPCs are often used in smaller test envionments. You can create a default VPC by using this command: `aws ec2 create-default-vpc --region us-west-2` . If that fails you may not have permission to create this and have to reach out to your AWS Administrator for a solution. In NVFlare versions >= 2.4.1 you are given an option to pick your own --vpc-id and --subnet-id.
+
 #### SSL issue 
 
 You may get this SSL error in log.txt with some versions of Python and Red Hat linux
@@ -410,11 +415,10 @@ then you add the packages you need in the client to `startup/requirements.txt` :
 echo -e "torch \ntorchvision \ntensorboard" >> startup/requirements.txt
 ```
 
-now you have the option of using an improved patched version of the AWS installer which allows you to skip many of the [additional configuration steps](#additional-configuration-steps) below. To use the patched version run these commands:
+now you have the option of using an improved patched version of the AWS installer which allows you to skip many of the [additional configuration steps](#additional-configuration-steps) below. To use the patched version simply run this command to download and replace the existing aws_start.sh script:
 
 ```bash
-wget https://raw.githubusercontent.com/dirkpetersen/nvflare-cancer/main/aws_start.sh.patch -O aws_start.sh.patch
-patch startup/aws_start.sh < aws_start.sh.patch
+wget https://raw.githubusercontent.com/dirkpetersen/nvflare-cancer/main/aws_start.sh -O startup/aws_start.sh
 ```
 
 After this, run the `startup/start.sh` script or follow [these instructions to install the client on AWS](https://nvflare.readthedocs.io/en/main/real_world_fl/cloud_deployment.html#deploy-fl-client-on-aws): 
@@ -423,6 +427,8 @@ After this, run the `startup/start.sh` script or follow [these instructions to i
 startup/start.sh --cloud aws     # you can get more automation by using: --config my_config.txt
 ```
 
+**Note**: If you receive a VPC error such as (`VPCIdNotSpecified`), you may be able to mitigate the issue by using this command: `aws ec2 create-default-vpc --region us-west-2`. You can find more details in the troubleshooting section under [No default VPC](#no-default-vpc)
+
 **Below we assume you use the patched version**
 
 Now you need to confirm or change a few default settings. After confirming your AWS region you can edit the AMI image name (which supports wildcards *), that is used to search AWS for an AMI image ID for your specific AWS region. Our default here is Ubuntu 22.04 as it has the latest supported Python version (3.10). You can also change amd64 to arm64 as ARM based instances are sometimes lower cost. 
@@ -434,8 +440,8 @@ Note: run this command first for a different AWS profile:
 
 * Cloud EC2 region, press ENTER to accept default: us-west-2
 * Cloud AMI image name, press ENTER to accept default (use amd64 or arm64): ubuntu-*-22.04-arm64-pro-server
-    retrieving AMI ID for ubuntu-*-22.04-arm64-pro-server...
-    finding smallest instance type with 1 GPUs and 15360 MiB VRAM ... g5g.xlarge
+    retrieving AMI ID for ubuntu-*-22.04-arm64-pro-server ... ami-0d0b0cfbf4ce38093 found
+    finding smallest instance type with 1 GPUs and 15360 MiB VRAM ... g5g.xlarge found
 * Cloud EC2 type, press ENTER to accept default: g5g.xlarge
 * Cloud AMI image id, press ENTER to accept default: ami-0d0b0cfbf4ce38093
 region = us-west-2, EC2 type = g5g.xlarge, ami image = ami-0d0b0cfbf4ce38093 , OK? (Y/n)
@@ -457,26 +463,26 @@ Installing os packages with apt in nvflare_client, this may take a few minutes .
 Installing user space packages in nvflare_client, this may take a few minutes ...
 System was provisioned
 To terminate the EC2 instance, run the following command:
-  aws ec2 terminate-instances --instance-ids i-0dbbd2fb9a37c6783
+  aws ec2 terminate-instances --region us-west-2 --instance-ids i-0dbbd2fb9a37c6783
 Other resources provisioned
 security group: nvflare_client_sg_5254
 key pair: NVFlareClientKeyPair
 review install progress:
-  tail -f /tmp/nvflare.log
+  tail -f /tmp/nvflare-aws-YGR.log
 login to instance:
   ssh -i /home/dp/NVFlare/NVFlareClientKeyPair_i-0dbbd2fb9a37c6783.pem [email protected]
 ```
 
 Now try logging in :
 
 ```bash
-ssh -i /home/dp/NVFlare/NVFlareClientKeyPair.pem [email protected]
+ssh -i /home/dp/NVFlare/NVFlareClientKeyPair_i-0dbbd2fb9a37c6783.pem [email protected]
 ```
 
-or wait until the install has finished, you can check progress in /tmp/nvflare.log on your machine:
+or wait until the install has finished, you can check progress in /tmp/nvflare-aws-YGR.log on your machine:
 
 ```bash
-tail -f /tmp/nvflare.log
+tail -f /tmp/nvflare-aws-YGR.log
 ```
 
 #### additional configuration steps
@@ -695,7 +701,7 @@ The NVFlare dashboard will be created in an isolated AWS account. Please see the
 nvflare dashboard --cloud aws
 ```
 
-If you receive a VPC error such as (`VPCIdNotSpecified`) it means that no default network configuration ([Default VPC](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html)) has been created by your AWS administrator. Default VPCs are often used in smaller test envionments. You can create a default VPC by using this command: `aws ec2 create-default-vpc` . If that fails you may not have permission to create this and have to reach out to your AWS Administrator for a solution. In NVFlare versions > 2.4 you will also be able to pick your own VPC. 
+**Note**: If you receive a VPC error such as (`VPCIdNotSpecified`), you may be able to mitigate the issue by using this command: `aws ec2 create-default-vpc --region us-west-2`. You can find more details in the troubleshooting section under [No default VPC](#no-default-vpc)
 
 After the dashboard is started you will see a dashboard URL that includes an IP address and looks like `http://xxx.xxx.xxx.xxx:443`. Make sure you record the email address and the 5 digit initial password that is displayed in the terminal. Verify that you can login with email address as the user and the password at that URL. You can change your password at `MY INFO -> Edit My Profile`
 
@@ -950,3 +956,55 @@ sudo reboot
 ## Installing Client
 
 please see [Using NVFlare as an Org Admin](#using-nvflare-as-an-org-admin)
+
+# Contributing code to NVFlare 
+
+If you would like to make a code contribution to NVFlare, please check the [contributor docs](https://nvflare.readthedocs.io/en/main/contributing.html) first. 
+In our case, we have made modificaitons to the cloud deployment scripts and constributed some changes back. Please take these steps after [creating a Fork](https://github.com/NVIDIA/NVFlare/fork) of NVFlare: 
+
+```
+git clone [email protected]:your-organization/NVFlare.git
+git clone [email protected]:dirkpetersen/nvflare-cancer.git
+cd NVFlare
+```
+
+check folder `nvflare\lighter\impl` and make modifications to `aws_template.yml` and/or  `master_template.yml`. Then generate a new aws_start.sh script for an NVFlare client in one of your client starter kits startup folder: 
+
+```
+../nvflare-cancer/make-aws-client-script.py /starter-kit-folder/startup/aws_start.sh
+```
+
+Test this aws_start.sh script intensely before you run `runtest.sh` and commit the code to your forked NVFlare repository and then create a pull request in Github. The make-aws-client-script.py uses the NVFlare internal machinery to generate shell scripts from yaml files:
+
+```python
+#! /usr/bin/env python3
+
+import os, sys
+from nvflare.lighter import tplt_utils, utils
+
+client = "AWS-T4"
+org = "Test"
+
+lighter_folder = os.path.dirname(utils.__file__)
+template = utils.load_yaml(os.path.join(lighter_folder, "impl", "master_template.yml"))
+template.update(utils.load_yaml(os.path.join(lighter_folder, "impl", "aws_template.yml")))
+tplt = tplt_utils.Template(template)
+csp = 'aws'
+if len(sys.argv) > 1:
+    dest = sys.argv[1]
+else:
+    dest = os.path.join(os.getcwd(), f"{csp}_start.sh")
+script = template[f"cloud_script_header"] + template[f"{csp}_start_sh"]
+script = utils.sh_replace(
+            script, {"type": "client", "inbound_rule": "", "cln_uid": f"uid={client}", "ORG": org}
+        )
+utils._write(
+    dest,
+    script,
+     "t",
+     exe=True,
+)
+print(f"Script written to {dest} !")
+```
+
+
diff --git a/aws_start.sh b/aws_start.sh
@@ -28,7 +28,7 @@ function prompt() {
   fi
 }
 
-get_resources_file() {
+function get_resources_file() {
   local rfile="${DIR}/../local/resources.json"  
   if [ -f "${rfile}" ]; then
     echo "${rfile}"
@@ -40,7 +40,7 @@ get_resources_file() {
   fi
 }
 
-find_ec2_gpu_instance_type() {
+function find_ec2_gpu_instance_type() {
   local gpucnt=0
   local gpumem=0
   if rfile=$(get_resources_file); then
@@ -89,19 +89,20 @@ do
   esac
   shift
 done
-TMPDIR="${TMPDIR:-/tmp}"
-LOGFILE=$(mktemp "${TMPDIR}/nvflare-aws-XXX")
 VM_NAME=nvflare_client
 SECURITY_GROUP=nvflare_client_sg_$RANDOM
 KEY_PAIR=NVFlareClientKeyPair
 KEY_FILE=$(pwd)/${KEY_PAIR}.pem
-IMAGE_OWNER="099720109477" # Owner account id=Amazon
-ARCH=x86_64
+AMI_IMAGE_OWNER="099720109477" # Owner account id=Amazon
 AMI_NAME="ubuntu-*-22.04-amd64-pro-server"
+AMI_ARCH=x86_64
+EC2_TYPE_ARM=t4g.small
+
 AMI_IMAGE=ami-01ed44191042f130f  # 22.04  20.04:ami-063da375c17d500ab 24.04:ami-0833a2b4abf788b34  (us-west-2 only)
 EC2_TYPE=t2.small
-EC2_TYPE_ARM=t4g.small
-NVIDIA_OS_PKG="nvidia-driver-550-server"
+TMPDIR="${TMPDIR:-/tmp}"
+LOGFILE=$(mktemp "${TMPDIR}/nvflare-aws-XXX")
+
 
 echo "This script requires aws (AWS CLI), sshpass, dig and jq.  Now checking if they are installed."
 
@@ -156,7 +157,7 @@ if [ $useDefault = true ]; then
     if [ ${container} = false ]; then
       read -e -i ${AMI_NAME} -p "* Cloud AMI image name, press ENTER to accept default (use amd64 or arm64): " AMI_NAME
       printf "    retrieving AMI ID for ${AMI_NAME} ... " 
-      IMAGES=$(aws ec2 describe-images --region ${REGION} --owners ${IMAGE_OWNER} --filters "Name=name,Values=*${AMI_NAME}*" --output json)
+      IMAGES=$(aws ec2 describe-images --region ${REGION} --owners ${AMI_IMAGE_OWNER} --filters "Name=name,Values=*${AMI_NAME}*" --output json)
       if [ "${#IMAGES}" -lt 30 ]; then
         echo -e "\nNo images found, starting over\n"
         continue
@@ -278,7 +279,7 @@ if [ $container = true ]; then
   report_status "$?" "launching container"
 else
   # Spawn a process to install os packages as root
-  echo "Installing os packages as root in the background, this may take a few minutes ... "
+  echo "Installing os packages as root in $VM_NAME, may take a few minutes ... "
   ssh -f -i ${KEY_FILE2} -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ${DEST_SITE} \
   ' NVIDIA_OS_PKG="nvidia-driver-550-server" && sudo apt update && \
   sudo DEBIAN_FRONTEND=noninteractive apt install -y python3-dev gcc && \
@@ -289,7 +290,7 @@ else
   report_status "$?" "installing os packages"
   sleep 10
   # Spawn a process to install packages as user
-  echo "Installing user space packages in the background, this may take a few minutes ... "
+  echo "Installing user space packages in $VM_NAME, may take a few minutes ... "
   ssh -f -i ${KEY_FILE2} -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ${DEST_SITE} \
   ' echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc && \
   export PATH=/home/ubuntu/.local/bin:$PATH && \

diff --git a/aws_start.sh.2.4.0.patch b/aws_start.sh.2.4.0.patch
diff --git a/make-aws-client-script.py b/make-aws-client-script.py
@@ -0,0 +1,28 @@
+#! /usr/bin/env python3
+
+import os, sys 
+from nvflare.lighter import tplt_utils, utils
+
+client = "AWS-T4"
+org = "Test"
+
+lighter_folder = os.path.dirname(utils.__file__)
+template = utils.load_yaml(os.path.join(lighter_folder, "impl", "master_template.yml"))
+template.update(utils.load_yaml(os.path.join(lighter_folder, "impl", "aws_template.yml")))
+tplt = tplt_utils.Template(template)
+csp = 'aws'
+if len(sys.argv) > 1:
+    dest = sys.argv[1]
+else:
+    dest = os.path.join(os.getcwd(), f"{csp}_start.sh")
+script = template[f"cloud_script_header"] + template[f"{csp}_start_sh"]
+script = utils.sh_replace(
+            script, {"type": "client", "inbound_rule": "", "cln_uid": f"uid={client}", "ORG": org}
+        )
+utils._write(
+    dest,
+    script,
+     "t",
+     exe=True,
+)
+print(f"Script written to {dest} !")