Skip to content

New Production with Multiple Instances and vGPU

Hanwen Zh edited this page Aug 26, 2024 · 14 revisions

Overview

As of July 2024, Compute Canada suggests that we migrate to vGPU instances and old hardware for GPU instances will come to the end of service soon. This page will include all the steps to set up (or reproduce) the current Rodan production server(s). Here is a summary of what we have now for rodan2.simssa.ca. Some reasoning behind this choice can be found in issue #1184.

  1. A manager instance on Ubuntu 20.04 with Docker version 24.0.2, build cb74dfc, with 8 vCPUs and 1 vGPU (driver 550, 16GiB GPU RAM) and 40 GiB instance RAM.
  2. A worker instance on Ubuntu 20.04 with Docker version 24.0.2, build cb74dfc, with 16 vCPUs and 16 GiB instance RAM.

Note: do not upgrade Docker to any newer versions unless we are sure later Docker Engine does not lead to DNS resolution issues on Ubuntu.

We distribute containers on two instances as follows using Docker swarm.

On manager instance:

  • rodan_rodan-main
  • rodan_celery
  • rodan_gpu-celery
  • rodan_nginx
  • rodan_postgres
  • rodan_py3-celery
  • rodan_redis

On worker instance:

  • rodan_rabbitmq
  • rodan_iipsrv
  • rodan_rodan-client

Ideally, we want to put py3-celery on the worker instance at least. Although it is possible (and tested) with Debian 11 and 12, with Ubuntu 20.04, we have to put all those on the same instance to avoid redis timeout issue. Given the current limit of 8 vCPUs on the manager instance, the performance will be improved greatly if we can fix this and move those containers to the worker instance.

At this point, our manager instance is boot from the old prod_Rodan2_GPU disk with all the user data and resources, and therefore it is best practice to put postgres on this manager instance as well. Two instances share the data via NFS.

Also, upon testing, the p instance type (the worker instance) can easily be resized while retaining the same IP and Docker network.

Prepare Instances

Important notes on 26/08/2024

We experienced a major server crash: the GPU driver mysteriously disappeared, and the Docker service consumed so much memory that it could neither be launched nor modified. Despite trying everything we could to rescue the server, nothing worked, and the instance continued to report out-of-memory kills for any process we attempted to run. In the end, we realized that the only solution was to deploy a new server.

However, new problems arose: while we could accomplish everything with Debian 11, we couldn't run PACO training using the GPU. On the other hand, when using Ubuntu 20.04, we were unable to deploy the Docker service.

Later, we discovered the root of the problem preventing us from launching a full Docker Swarm. When launching a new Arbutus instance with Ubuntu 20.04, the default Linux kernel is a KVM version (which you can verify by running uname -r). This kernel is compact and optimized for virtual machines, but it does not include IPVS, which is necessary for virtual IP services. To use IPVS, a generic Linux kernel is required or we have to compile our own kernel.

While it is possible to directly install a new kernel and boot into it (with some complicated steps), doing so would cause another issue— the inability to properly use the NVIDIA GPU driver that comes with the vGPU instance.

To resolve this, the best approach is to start with the old Rodan volume that uses the old generic Linux kernel (or create a volume from a snapshot), boot it in another cloud environment (such as a persistent p-flavor instance), upgrade to the desired Ubuntu version (currently 20.04), then delete the instance and reboot it as a vGPU instance. Now, if you SSH into this new instance and check the kernel, it will be the desired generic version. Installing the vGPU driver at this point will also install the necessary KVM kernel, thereby avoiding compatibility issues between the generic kernel and the vGPU driver, while keeping the default kernel as the generic version that includes IPVS.

Since this process is quite complex, we've saved multiple snapshots at each step for backup purposes.

Original documentation

Go to Arbutus openstack page, and click Launch Instance. Here is the information to fill out the form.

  • Details: Any reasonable name and description. Make sure Availability Zone is Any.
  • Source: For manager, it is boot from volume (and therefore the OS depends on the volume). For worker, it is boot from image and we pick the same OS (Ubuntu 20.04 in this case) and create a volume (1500 or 2000 GiB is fine). Make sure Delete Volume on Instance Delete is False for both worker and manager.
  • Flavor: As of July 2024 we use g1-16gb-c8-40gb for manager and p16-16gb for worker.
  • Networks: Select rpp-ichiro-network.
  • Security Groups: Deselect default and select prod-internal.
  • Configuration: Upload cloud.init from the ansible repo.
  • Metadata: Add "rodan" label so that the new instance can be automatically added to os_service_rodan group managed by ansible. (This can also be done later.) Don't do anything else.

After this, run ansible useradd and adminadd to be able to ssh to the new instance.

Set up vGPU drivers (on vGPU instance only)

  1. Remove any existing Nvidia drivers.
sudo apt-get purge "*nvidia*"
  1. Follow the official guide from Compute Canada here according to the OS version.
  2. Install nvidia-container-toolkit. (Official websitehere)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
  1. Install container runtime.
sudo apt install nvidia-container-runtime

Install Docker (on both instances)

  1. Make sure to follow the Docker Guide for specific OS and install the exact version we want.
  2. Set up nvidia runtime for docker following guide here. Prereq: (1) NVIDIA Container Toolkit; (2) Docker.

Steps:

a. sudo nvidia-ctk runtime configure --runtime=docker --set-as-default

b. sudo systemctl restart docker

c. run docker info and verify docker runtime has nvidia

Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia

Warning: steps from here are based on practice as there's no related official guide.

d. in /etc/docker/daemon.json make sure it has full path like

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "args": []
        }
    },
    "default-runtime": "nvidia"
}

e. restart daemon and docker

systemctl daemon-reload
systemctl restart docker

Set up GitHub (on both instances)

  1. Generate the key pair.
ssh-keygen -t rsa -b 4096 -C "[[email protected]](mailto:[email protected])"

We can name it rodan-docker.

  1. Enter the public key (~/.ssh/rodan-docker.pub)in the github repo settings deploy keys with a name associated to the server. Make sure Allow write access is off.
  2. Create a config file in the ssh folder ~/.ssh/.
Host github.com
  HostName github.com
  User git
  IdentityFile ~/.ssh/rodan-docker
  1. Test ssh.
ssh github.com

It should return

PTY allocation request failed on channel 0
Hi DDMAL/rodan-docker! You've successfully authenticated, but GitHub does not provide shell access.
Connection to github.com closed.
  1. Clone the Rodan repo.
cd /srv/webapps/
git clone --single-branch -b master [email protected]:DDMAL/Rodan.git

Make sure to double check the branch.

Set up Rodan

  1. Modify scripts/production.env to have all the credentials.
  2. Modify rodan-client/config/configuration.json to have 443 port True:
 "SERVER_HOST": "rodan2.simssa.ca",
    "SERVER_PORT": "443",
    "SERVER_HTTPS": true,
  1. Adjust production.yml and make sure we have reasonable resource allocation for each container. Here is the one that works for the current specific settings.
version: "3.4"

services:

  nginx:
    image: "ddmal/nginx:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "0.25"
          memory: 0.5G
        limits:
          cpus: "0.25"
          memory: 0.5G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "/usr/sbin/service", "nginx", "status"]
      interval: "30s"
      timeout: "10s"
      retries: 10
      start_period: "5m"
    command: /run/start
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      TLS: 1
    ports:
      - "80:80"
      - "443:443"
      - "5671:5671"
      - "9002:9002"
    volumes:
      - "resources:/rodan/data"

  rodan-main:
    image: "ddmal/rodan-main:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "0.8"
          memory: 3G
        limits:
          cpus: "0.8"
          memory: 3G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD-SHELL", "/usr/bin/curl -H 'User-Agent: docker-healthcheck' http://localhost:8000/api/?format=json || exit 1"]
      interval: "30s"
      timeout: "30s"
      retries: 5
      start_period: "2m"
    command: /run/start
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      CELERY_JOB_QUEUE: None
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  rodan-client:
    image: "ddmal/rodan-client:nightly"
    deploy:
      placement:
        constraints:
          - node.role == worker
    volumes:
        - "./rodan-client/config/configuration.json:/client/configuration.json"

  iipsrv:
    image: "ddmal/iipsrv:nightly"
    volumes:
      - "resources:/rodan/data"

  celery:
    image: "ddmal/rodan-main:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "0.8"
          memory: 4G
        limits:
          cpus: "0.8"
          memory: 4G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@celery", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      start_period: "1m"
      retries: 5
    command: /run/start-celery
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      CELERY_JOB_QUEUE: celery
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  py3-celery:
    image: "ddmal/rodan-python3-celery:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "3"
          memory: 6G
        limits:
          cpus: "3"
          memory: 6G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@Python3", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      retries: 5
    command: /run/start-celery
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      CELERY_JOB_QUEUE: Python3
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  gpu-celery:
    image: "ddmal/rodan-gpu-celery:v3.0.0"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 18G
        limits:
          cpus: "1"
          memory: 18G
      placement:
        constraints:
          - node.role == manager
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
    healthcheck:
      test: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@GPU", "-t", "30"]
      interval: "30s"
      timeout: "30s"
      retries: 5
    command: /run/start-celery
    environment:
      TZ: America/Toronto
      SERVER_HOST: rodan2.simssa.ca
      CELERY_JOB_QUEUE: GPU
    env_file:
      - ./scripts/production.env
    volumes:
      - "resources:/rodan/data"

  redis:
    image: "redis:alpine"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "1"
          memory: 2G
        limits:
          cpus: "1"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5
    environment:
      TZ: America/Toronto

  postgres:
    image: "ddmal/postgres-plpython:v3.0.0"
    deploy:
      replicas: 1
      endpoint_mode: dnsrr
      resources:
        reservations:
          cpus: "1"
          memory: 2G
        limits:
          cpus: "1"
          memory: 2G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == manager
    healthcheck:
      test: ["CMD-SHELL", "pg_isready", "-U", "postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
    environment:
      TZ: America/Toronto
    volumes:
      - "pg_data:/var/lib/postgresql/data"
      - "pg_backup:/backups"
    env_file:
      - ./scripts/production.env

  rabbitmq:
    image: "rabbitmq:alpine"
    deploy:
      replicas: 1
      resources:
        reservations:
          cpus: "3"
          memory: 4G
        limits:
          cpus: "3"
          memory: 4G
      restart_policy:
        condition: any
        delay: 5s
        window: 30s
      placement:
        constraints:
          - node.role == worker
    healthcheck:
      test: ["CMD", "rabbitmq-diagnostics", "-q", "ping"]
      interval: "30s"
      timeout: "3s"
      retries: 3
    environment:
      TZ: America/Toronto
    env_file:
      - ./scripts/production.env

volumes:
  resources:
  pg_backup:
  pg_data:

Set up NFS

NFS allows both instances to access the same directory and sync. Steps below are mostly from this guide. Also check this Ubuntu guide.

  1. Install NFS packages. On manager (or the instance with the Rodan data)
sudo apt update
sudo apt install nfs-kernel-server

On worker

sudo apt update
sudo apt install nfs-common
  1. Change ownership of directory on manager.
sudo chown nobody:nogroup /var/lib/docker/volumes/
  1. Configure NFS exports on manager.
sudo nano /etc/exports

and add this line: /var/lib/docker/volumes [IP address for worker instance, usually 192.168.xx.xx](rw,sync,no_subtree_check,no_root_squash)

  1. Restart NFS on manager.
sudo systemctl restart nfs-kernel-server
  1. Mount NFS on worker. Make sure you purge everything in /var/lib/docker/volumes/ so that it is an empty directory.
sudo mount [manager IP]:/var/lib/docker/volumes/ /var/lib/docker/volumes/

Check if it has been correctly connected by df -h.

  1. (Optional but recommended) Mount the NFS directory at boot on worker.
sudo nano /etc/fstab

and add this line: [manager IP]:/var/lib/docker/volumes /var/lib/docker/volumes nfs auto,nofail,noatime,nolock,intr,tcp,actimeo=1800 0 0

Set up Nginx

This is usually done by ansible /playbooks/nginxconf.yml after the current manager (or the instance running Nginx) IP has been updated in /playbooks/vars/simssa.ca.yml under rodan2 block.

Deploy Docker Swarm

(with sudo -i on both instances)

  1. On manager
docker swarm init

and you will see a command for worker join token. If swarm is already running, then run docker swarm join-token worker.

  1. On worker, run the command generated in the previous step.

  2. On manager, verify there are two nodes by docker node ls.

  3. Start Rodan.

make pull_prod
make deploy_production
  1. Verify Rodan service is correctly running by docker service ls on manager and docker ps -a on both instances. Sometimes rodan_main will fail when the stack is just launched but docker swarm will successfully reproduce it later when other containers are ready.

  2. Some debugging commands that might be helpful:

docker info
docker service logs [service id]
docker service ps [service id] --no-trunc
docker logs [container id]
docker exec -it [container id] [bash or sh]
  1. Some useful commands to run from /srv/webapps/Rodan on instance what runs the corresponding container, which can be found in Makefile:
make gpu-celery_log
make py3-celery_log
make celery_log
make rodan-main_log

Future work

We might consider hosting data on a separate instance so that we do not have to stick with Ubuntu 20.04 and fit all big containers in the manager instance.

Also, to upgrade OS, if the nova cloud (for all GPU related instances) repo does not provide upgrade option, it is possible to delete the instance, boot the same volume as a regular p instance and do the OS upgrade in the persistent cloud. After the volume has been upgraded to the desired the newer OS version, we can delete the instance and boot a new vGPU instance from the same volume.

Be sure to search old issues and PRs for more notes.

We have not implemented the auto upgrade but instructions are here.

Clone this wiki locally