Skip to content

How to run experiments on Exx

William Guimont-Martin edited this page Sep 14, 2023 · 21 revisions

Using Slurm and Docker to run jobs

This guide's purpose is to be a quickstart guide to run jobs on the Exx server.

SSH access

The IP of the server is written on the server.

ssh <username>@<ip>

Transfer data to the server

Run sftp to enter the tool prompt:

sftp <username>@<ip>

Once inside sftp, run the following commands to transfer files

# change directory on the server
sftp> cd path/to/my/cool/files
# change directory on the local computer
sftp> lcd path/to/my/cool/local_files
# similarly, you can use ls to list remote files, and lls to list local files
# to get help, run
sftp> help

# to download files from the server
sftp> get remote_file.txt
# to upload files to the server
sftp> put local_file.txt

Docker

All jobs on the server must use Docker containers to avoir polluting the server with project related dependencies.

Installation

The following steps should be executed on your computer, not on the server. The server already has everything it needs.

Docker

Install Docker Engine and follow the post-installation steps for Linux to allow non-sudo users to use docker.

Start docker on boot:

sudo systemctl enable docker.service
sudo systemctl enable containerd.service

Consider adding the following line to your .bashrc for faster docker image builds:

export DOCKER_BUILDKIT=1

nvidia-container-toolkit

Then, to be able to access NVIDIA GPUs from inside docker containers, install nvidia-container-toolkit. For Ubuntu-based OS, follow this installation guide from NVIDIA.

For CentOS-based systems, follow those instructions:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
 
sudo yum clean expire-cache
sudo yum install nvidia-container-toolkit -y

nvidia-container-runtime

You'll also need nvidia-container-runtime, to install it on Ubuntu-based OS, follow this installation guide.

nvidia default runtime

Edit/create the /etc/docker/daemon.json with the below content:

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}

For more information, see this issue.

A reboot will be necessary.

Verify installation

Verify that you can see GPUs from docker containers:

docker run --rm --gpus all -e NVIDIA_VISIBLE_DEVICES=all pytorch/pytorch:1.10.0-cuda11.3-cudnn8-devel nvidia-smi

Running your projects in Docker

Read this guide from docs.docker.com to get started with Docker.

.dockerignore

First of all, create a new file called .dockerignore and ignore everything you don't want to copy into the container. Usually, this file ignores the same things you'd ignore in a .gitignore. You'll want to ignore heavy files like datasets and Python's virtual environments.

Dockerfile

Then, create your docker file. Here's a small guide that should get you started with Docker: Sample application. Also, you can check if the libraries you use provide Dockerfile's; they can provide a good starting point for your own Dockerfile. Ex. detectron2's Dockerfile.

Managing volumes for training

See norlab-ulaval/mask_bev/blob/main/docs/TRAINING.md.

Managing volumes with docker-compose.yml (NOT RECOMMENDED)

Then, create a file named docker-compose.yml, and adapt the following code:

services:
  training:
    build: .  # should point to the folder containing the Dockerfile
    stdin_open: true
    tty: true
    ipc: host
    # Add entries to volume to access data on the host computer
    volumes:
      - local/path:remote/path
    entrypoint: run.sh  # script that will be run (start the training)
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [$CUDA_VISIBLE_DEVICES]
              capabilities: [gpu]

This file allows to easily manage volumes, that is shared folder between your computer and the docker container. Be sure to have the correct permissions on your volumes! Check them on your system with ls -l.

Then, you can start it with:

docker-compose up --build

Slurm jobs

To manage jobs on the Exx server, we use Slurm Workload Manager. I allows to schedule jobs and manage resources like GPU.

Consult norlab-ulaval/Norlab_wiki/wiki/Compute-Canada-Quickstart to get a quick start on how to use Slurm to schedule jobs.

Job example: norlab-ulaval/mask_bev/blob/main/slurm_train.sh

Quickly, a job script must follow to following format:

#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --time=4-00:00
#SBATCH --job-name=47_swinpillars_4th
#SBATCH --output=%x-%j.out

cd project_folder || exit
docker-compose up --build

The header tells Slurm what resources you need to run your job.

  • --gres=gpu:N allows to specify the number of GPUs you want to use
  • --cpus-per-task=N allows to specify the number of CPUs you want to use
  • --time=D-HH:MM allows to specify the maximum duration of the job (D=day, HH=hours, MM=minutes)
  • --job-name=FOO allows to specify the name of the job
  • --output=%x-%j.out formats the output log of the jobs

To schedule a job, run:

sbatch script.s

You can then see the queued jobs using:

squeue

See more information on running jobs:

export SQUEUE_FORMAT="%.18i %.9P %.25j %.8u %.2t %.10M %.6D %.20e %b %.8c"
squeue

More information on the format using man squeue.

To cancel the job with id <id>, use:

scancel <id>

To run an interactive job, useful to test stuff and do one-off things, use:

salloc --gres=gpu:1 --cpus-per-task=4 --time=5:00:00

NUMA

If you encounter this error:

set_mempolicy: Operation not permitted

Run Docker with --privileged.

Singularity

https://researchcomputing.princeton.edu/support/knowledge-base/singularity

Norlab's Robots

Protocols

Templates

Resources

Grants

Datasets

Mapping

Deep Learning

ROS

Ubuntu

Docker (work in progress)

Tips & tricks

Clone this wiki locally