Front end to deploy any of the IBM PowerAI supported Deep Learning Frameworks across multiple target machines.
IBM's PowerAI documentation can be found at ibm.biz/powerai The PowerAI download page is at here
caffe-bvlc Berkeley Vision and Learning Center caffe Framework
caffe-nv Nvidia Fork of the BVLC version of caffe
caffe-ibm IBM Version of the caffe framework.
digits DIGITS (the Deep Learning GPU Training System) is a webapp for training deep learning models.
tensorflow TensorFlow™ is an open source software library for numerical computation using data flow graphs. Provided by Google
torch orch is the main package in Torch7 where data structures for multi-dimensional tensors and mathematical operations over these are defined.
theano Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. Theano features:
Before proceeding to the installation steps, a few tasks need to be completed before hand.
Create a network bridge named "br0" with port connected to management network (192.168.3.0/24).
Below is an example interface defined in the local "/etc/network/interfaces" file. Note that "enP1p3s0f0" is the name of the interface connected to the management network. Check the recipe documentation for more information on this step.
- auto br0
- iface br0 inet static
- address 192.168.3.3
- netmask 255.255.255.0
- bridge_ports enP1p3s0f0
PowerAI requires NVIDIA's CuDNN library. Visit https://developer.nvidia.com/cudnn. The current versions of the included frameworks are built using CuDNN 5.
- Login or register for NVIDIA's Accelerated Computing Developer Program.
- Download the following .deb files
- cuDNN v5.1 Runtime Library for Ubuntu 16.04 Power8 (Deb)
- cuDNN v5.1 Developer Library for Ubuntu 16.04 Power8 (Deb)
- Copy the .deb files to the management server and export the CUDNN5 and CUDNN environment variable pointing to the location of the .deb files.
This solution has an option to use an available InfiniBand network. In order to automate the network configuration, the Mellanox OFED package is required. Visit the Mellanox download site
- Download the latest .tgz file for Ubuntu 16.04 ppc64le
- Copy the .tgz file to the management server and export the MLX_OFED environment variable pointing to the location of the .tgz file.
- git clone https://github.com/open-power-ref-design/deep-learning
- Run
install.sh
- Build/edit the config.yml file using one of the templates provided.
- Run
deploy.sh <desired config file>
to initiate deployment.
The first of many rack solutions designed to allow for ease of deploying a distributed workload across many heteregenous nodes in a cluster. This recipe can used to install a distributed training cluster for TensorFlow.
The PowerAI recipe can deploy distributed tensorflow across any possible combination of machines. For sample purposes, we have provided a configuration with 3 nodes(1 Parameter Server, and 2 Workers)
config.yml.tfdist.3min
. More information in the recipe document in the /documents folder.
In every configuration, there are Parameter, and Worker nodes. In here you can specify which machines are designated to which server type.
The configuration sample above will ensure tensorflow, cuda, and cudnn are installed on each machine in the cluster.
The Distributed Tensorflow config for the PowerAI recipe currently includes two sample training sets. Cifar10, and mnist.
All samples can be found in /opt/DL/tensorflow/samples
/opt/DL/tensorflow/samples/dist_deb.sh
is custom created to reflect the racksetup on every machine. To try out a specific sample execute dist_deb.sh
with the desired model
i.e. dist_deb.sh mnist/mnist_train.py
or dist_deb.sh cifar10/cifar10_train.py