Skip to content

AWS for deep learning

John Pearson edited this page Nov 27, 2017 · 13 revisions

Setting up Key (for 1st time):

  • Get penaltykick.pem file from John.
  • Copy to ~/.ssh folder in your own directory.
  • chmod 400 penaltykick.pem.

Log in to Amazon Console for the lab with MFA. Navigate to EC2 > {X} Running Instances

How to start up a computer on Amazon

  • click on PK{X}
  • Actions --> Instance State --> Start --> Yes, Start (this will take a few minutes to boot up)
  • copy (under the Descriptions tab) the Public DNS to your clipboard.
  • open a terminal on your computer (from home directory).
ssh -i ~/.ssh/penaltykick.pem ubuntu@{DNS address}

You're in!

Starting a new type of instance

  • Please google Amazon EC2 Instance types to see the various types of instances Amazon provides and what they mean (i.e. m = memory optimize, c = compute optimize, etc.).
  • For our experiments, run a P3.2xlarge for a gpu-type experiment and run a c5.2xlarge for a cpu-type experiment.
  • Launch Instance --> AWS Marketplace
  • Search for "Amazon Deep Learning", then select "Deep learning AMI (Ubuntu)", then hit Continue.
  • Choose your desired instance type, then Configure Instance Details.
  • Keep all defaults or change as desired, then add "penaltykick" as a tag. Then configure security group.
  • Select existing security group, then select Deep Learning AMI ubuntu.
  • Point to your existing key pair, then launch.

Mounting Data Drive

  • Amazon --> Storage --> EFS
  • click on penalty kick data, then copy the "DNS name"
  • go back to terminal. ls ~/data to see if this directory exists, if not make one with mkdir.
  • execute the following, with changing DNS with the string saved from clipboard.
sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 <DNS>:/ ~/data

This should be enough. If you ls -al inside the home directory and find that data/ is not owned by ubuntu, you may also need to run

sudo chown ubuntu:ubuntu data

Cloning Code

You'll have to add to the instance whatever code you need that isn't available as part of one of the installed frameworks. If this is github code (i.e., ours), you'll need to (using the code directory as an example):

  • mkdir ~/code (if it doesn't exist)
  • cd ~/code
  • Clone the repo:
    git clone git://github.com/<user>/<repo>.git
    
    This syntax allows us to clone any public repo without the need for authentication (so no need to set up ssh keys for GitHub, etc.).
  • git checkout to whatever branch you want
  • Activate the Anaconda environment corresponding to your desired framework. E.g., for Tensorflow on Python 3.6:
    source activate tensorflow_p36
    

Running Code

  • Add the directory of git repository to python path
export PYTHONPATH=$PYTHONPATH:~/code
  • install Edward if needed
pip install edward
  • Navigate to ~/code/tf_gbds
  • Run whatever commands you want as you would in a normal terminal (save all results to ~/data).

Launching Tensorboard and Forwarding Port

  • After you mount data drive and activate tensorflow environment, launch tensorboard on ubuntu
tensorboard --logdir='/home/ubuntu/data'

The default port for tensorboard is 6006. You could also specify another port to use.

  • Forward the port where tensorboard is running to your local machine
ssh -i ~/.ssh/penaltykick.pem -N -f -L localhost:<target port on your machine>:localhost:6006 ubuntu@<public DNS>

You should be able to see tensorboard forwarded to the target port.

  • If the target port is already is in use, the forwarding will fail. You could either try forwarding to another port or find out the processes running in that port by sudo lsof -i :<target port> and end all processes by kill <PID>.