This repository provides a set of solutions for running AutoLFADS in a wider variety of compute environments. This enables more users to take better advantage of the hardware available to them to perform computationally demanding hyperparameter sweeps.
We provide three options for different cluster configurations and encourage the user to select the one that best suits their needs:
- Local Compute: users directly leverage a container image that bundles all the AutoLFADS software dependencies and provides an entrypoint directly to the LFADS package. Interactivity with this workflow is provided via YAML model configuration files and command line arguments.
- Unmanaged Compute (Ray): users configure a Ray cluster and interact with the workflow by updating YAML model configurations, updating hyperparameter sweep scripts, and then running experiment code.
- Managed Compute (KubeFlow): users interact with a KubeFlow service by providing an experiment specification that includes model configuration and hyperparameter sweep specifications either as a YAML file or using a code-less UI-based workflow.
The solution matrix below provides a rough guide for identifying an suitable workflow:
Local Container | Ray | KubeFlow | |
---|---|---|---|
Number of Users | 1 | 1-3 | >1 |
Number of Jobs | 1 | >1 | >1 |
Preferred Interaction | CLI | CLI | CLI / UI |
Infrastructure | Local | Unmanaged | Managed/Cloud |
Cost | $ | $ - $$ | $ - $$$ |
Details describing the AutoLFADS solutions and evaluation against the Neural Latents Benchmark datasets can be found in our paper.
Follow the appropriate guide below to run AutoLFADS on your target platform. We recommend copying the following files to your team's source control and modifying them as necessary to organize and execute custom experiments.
- Model configuration file (e.g.
examples/lorenz/data/config.yaml
) - KubeFlow configuration file (e.g.
examples/lorenz/kubeflow_job.yaml
) or Ray run script (e.g.examples/lorenz/ray_run.py
)
Running LFADS in a container provides isolation from your host operating system and instead relies on a system installed container runtime. This workflow is suitable for evaluating algorithm operation on small datasets or exploring specific model parameter changes. It is suitable for use on shared compute environments and other platforms where there is limited system package isolation.
Prerequisites: Container runtime (e.g. Docker - Linux / Mac / Windows, Podman - Linux / Mac / Windows, containerD - Linux / Windows) and the Nvidia Container Toolkit (GPU only).
Instructions are provided in docker syntax, but can be easily modified for other container runtimes
- Specify
latest
for CPU operation andlatest-gpu
for GPU compatible operationTAG=latest
- (OPTIONAL) Pull the docker image to your local machine. This step ensures you have the latest version of the image.
docker pull ucsdtnel/autolfads:$TAG
- Browse to a directory that has access to your data and LFADS configuration file
# The general structure should be as follows (names can be changed, just update the paths in the run parameters) # \<my-data-directory> # \data # <data files> # config.yaml (LFADS model parameter file) # \output # <location for generated outputs> cd <my-data-directory>
- Run LFADS (bash scripts provided in
examples
for convenience)# Docker flags # --rm removes container resources on exit # --runtime specifies a non-default container runtime # --gpus specifies which gpus to provide to the container # -it start the container with interactive input and TTY # -v <host location>:<container location> mount a path from host to container # $(pwd): expands the terminal working directory so you don't need to type a fully qualified path # AutoLFADS overrides # --data location inside container with data # --checkpoint location inside container that maps to a host location to store model outputs # --config-file location inside container that contains training configuration # KEY VALUE command line overrides for training configuration # For CPU docker run --rm -it -v $(pwd):/share ucsdtnel/autolfads:$TAG \ --data /share/data \ --checkpoint /share/container_output \ --config-file /share/data/config.yaml # For GPU (Note: $TAG value should have a `-gpu` suffix`) docker run --rm --runtime=nvidia --gpus='"device=0"' -it -v $(pwd):/share ucsdtnel/autolfads:$TAG \ --data /share/data \ --checkpoint /share/container_output \ --config-file /share/data/config.yaml
Running AutoLFADS using Ray enables scaling your processing jobs to many worker nodes in an ad-hoc cluster that you specify. This workflow is suitable for running on unmanaged or loosely managed compute resources (e.g. lab compute machines) where you have direct ssh access to the instances. It is also possible to use this workflow with VM based cloud environments as noted here.
Prerequisites: Conda
- Clone the latest version of
autolfads-tf2
git clone [email protected]:snel-repo/autolfads-tf2.git
- Change the working directory to the newly cloned repository
cd autolfads-tf2
- Create a new conda environment
conda create --name autolfads-tf2 python=3.7
- Activate the environment
conda activate autolfads-tf2
- Install GPU specific packages
conda install -c conda-forge cudatoolkit=10.0 conda install -c conda-forge cudnn=7.6
- Install LFADS
python3 -m pip install -e lfads-tf2
- Install LFADS Ray Tune component
python3 -m pip install -e tune-tf2
- Modify
ray/ray_cluster_template.yaml
with the appropriate information. Note, you will need to fill in values for all<...>
stubs. - Modify
ray/run_pbt.py
with the desired hyperparameter exploration configuration - Modify
ray/run_pbt.py
variableSINGLE_MACHINE
to beFalse
- Run AutoLFADS
python3 ray/run_pbt.py
Running AutoLFADS using KubeFlow enables scaling your experiments across an entire cluster. This workflow allows for isolated multi-user utilization and is ideal for running on managed infrastructure (e.g. University, public or private cloud) or on service-oriented clusters (i.e. no direct access to compute instances). It leverages industry standard tooling and enables scalable compute workflows beyond AutoLFADS for groups looking to adopt a framework for scalable machine learning.
If you are using a cloud provider, KubeFlow provides a series of tutorials to get you setup with a completely configured install. We currently require a feature that was introduced in Katib 0.14. The below installation provides a pathway for installing KubeFlow on a vanilla Kubernetes cluster integrating the noted changes.
Prerequisites: Kubernetes cluster access and Ansible (installed locally; only needed when deploying KubeFlow)
- Install Istio if your cluster does not yet have it
ansible-playbook istio.yml --extra-vars "run_option=install"
- Install NFS Storage Controller (if you need an RWX storage driver)
ansible-playbook nfs_storage_class.yml --extra-vars "run_option=install"
- Install KubeFlow
ansible-playbook kubeflow.yml --extra-vars "run_option=install"
- Use
examples/lorenz/kubeflow_job.yaml
as a template to specify a new job with desired hyperparameter exploration configuration and AutoLFADS configuration. Refer to the dataset README for details on how to acquire and prepare the data. - Run AutoLFADS
kubectl create -f kubeflow_job.yaml
- (Optional) Start or monitor job using KubeFlow UI
# Start a tunnel between your computer and the kubernetes network if you did not add an ingress entry kubectl port-forward svc/istio-ingressgateway -n istio-system --address 0.0.0.0 8080:80 # Browse to http://localhost:8080
- Results can be downloaded from the KubeFlow Volumes UI or directly from the data mount location.
Find a bug? Built new integration for AutoLFADS on your framework of choice? We'd love to hear about it and work with you to integrate your solution to this repository! Drop us an Issue or PR and we'd be happy to collaborate.
If you found this work helpful, please cite the following works:
@article{keshtkaran2021large,
title = {A large-scale neural network training framework for generalized estimation of single-trial population dynamics},
author = {Keshtkaran, Mohammad Reza and Sedler, Andrew R and Chowdhury, Raeed H and Tandon, Raghav and Basrai, Diya and Nguyen, Sarah L and Sohn, Hansem and Jazayeri, Mehrdad and Miller, Lee E and Pandarinath, Chethan},
journal = {BioRxiv},
year = {2021},
publisher = {Cold Spring Harbor Laboratory}
}
@article{Patel2023,
doi = {10.21105/joss.05023},
url = {https://doi.org/10.21105/joss.05023},
year = {2023},
publisher = {The Open Journal},
volume = {8},
number = {83},
pages = {5023},
author = {Aashish N. Patel and Andrew R. Sedler and Jingya Huang and Chethan Pandarinath and Vikash Gilja},
title = {High-performance neural population dynamics modeling enabled by scalable computational infrastructure},
journal = {Journal of Open Source Software}
}