-
Notifications
You must be signed in to change notification settings - Fork 2
How to run experiments on Exx
This guide's purpose is to be a quickstart guide to run jobs on the Exx server.
The IP of the server is written on the server.
ssh <username>@<ip>
Run sftp
to enter the tool prompt:
sftp <username>@<ip>
Once inside sftp
, run the following commands to transfer files
# change directory on the server
sftp> cd path/to/my/cool/files
# change directory on the local computer
sftp> lcd path/to/my/cool/local_files
# similarly, you can use ls to list remote files, and lls to list local files
# to get help, run
sftp> help
# to download files from the server
sftp> get remote_file.txt
# to upload files to the server
sftp> put local_file.txt
All jobs on the server must use Docker containers to avoir polluting the server with project related dependencies.
The following steps should be executed on your computer, not on the server. The server already has everything it needs.
Install Docker Engine and follow the post-installation steps for Linux to allow non-sudo users to use docker.
Start docker on boot:
sudo systemctl enable docker.service
sudo systemctl enable containerd.service
Consider adding the following line to your .bashrc
for faster docker image builds:
export DOCKER_BUILDKIT=1
Then, to be able to access NVIDIA GPUs from inside docker containers, install nvidia-container-toolkit. For Ubuntu-based OS, follow this installation guide from NVIDIA.
For CentOS-based systems, follow those instructions:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum clean expire-cache
sudo yum install nvidia-container-toolkit -y
You'll also need nvidia-container-runtime
, to install it on Ubuntu-based OS, follow this installation guide.
Edit/create the /etc/docker/daemon.json
with the below content:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
For more information, see this issue.
A reboot will be necessary.
Verify that you can see GPUs from docker containers:
docker run --rm --gpus all -e NVIDIA_VISIBLE_DEVICES=all pytorch/pytorch:1.10.0-cuda11.3-cudnn8-devel nvidia-smi
Read this guide from docs.docker.com to get started with Docker.
First of all, create a new file called .dockerignore
and ignore everything you don't want to copy into the container. Usually, this file ignores the same things you'd ignore in a .gitignore
. You'll want to ignore heavy files like datasets and Python's virtual environments.
Then, create your docker file. Here's a small guide that should get you started with Docker: Sample application. Also, you can check if the libraries you use provide Dockerfile's; they can provide a good starting point for your own Dockerfile. Ex. detectron2's Dockerfile.
See norlab-ulaval/mask_bev/blob/main/docs/TRAINING.md.
Then, create a file named docker-compose.yml
, and adapt the following code:
services:
training:
build: . # should point to the folder containing the Dockerfile
stdin_open: true
tty: true
ipc: host
# Add entries to volume to access data on the host computer
volumes:
- local/path:remote/path
entrypoint: run.sh # script that will be run (start the training)
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: [$CUDA_VISIBLE_DEVICES]
capabilities: [gpu]
This file allows to easily manage volumes, that is shared folder between your computer and the docker container. Be sure to have the correct permissions on your volumes! Check them on your system with ls -l
.
Then, you can start it with:
docker-compose up --build
To manage jobs on the Exx server, we use Slurm Workload Manager. I allows to schedule jobs and manage resources like GPU.
Consult norlab-ulaval/Norlab_wiki/wiki/Compute-Canada-Quickstart to get a quick start on how to use Slurm to schedule jobs.
Job example: norlab-ulaval/mask_bev/blob/main/slurm_train.sh
Quickly, a job script must follow to following format:
#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --time=4-00:00
#SBATCH --job-name=47_swinpillars_4th
#SBATCH --output=%x-%j.out
cd project_folder || exit
docker-compose up --build
The header tells Slurm what resources you need to run your job.
-
--gres=gpu:N
allows to specify the number of GPUs you want to use -
--cpus-per-task=N
allows to specify the number of CPUs you want to use -
--time=D-HH:MM
allows to specify the maximum duration of the job (D=day, HH=hours, MM=minutes) -
--job-name=FOO
allows to specify the name of the job -
--output=%x-%j.out
formats the output log of the jobs
To schedule a job, run:
sbatch script.s
You can then see the queued jobs using:
squeue
See more information on running jobs:
export SQUEUE_FORMAT="%.18i %.9P %.25j %.8u %.2t %.10M %.6D %.20e %b %.8c"
squeue
More information on the format using man squeue
.
To cancel the job with id <id>
, use:
scancel <id>
To run an interactive job, useful to test stuff and do one-off things, use:
salloc --gres=gpu:1 --cpus-per-task=4 --time=5:00:00
If you encounter this error:
set_mempolicy: Operation not permitted
Run Docker with --privileged
.
https://researchcomputing.princeton.edu/support/knowledge-base/singularity
- Warthog Teach and Repeat (ROS1)
- Warthog Teach and Repeat (ROS2)
- Time Synchronization (NTP)
- Time Synchronization (PTP)
- Deployment of Robotic Total Stations (RTS)
- Deployment of the backpack GPS
- Warthog Emlid GPS
- Atlans-C INS
- How to use a CB Radio when going in the forest
- IP forwarding
- Emlid Data Postprocessing (PPK)
- Setting up a reliable robot communication with Zenoh
- Lessons Learned
- Robots' 3D Models
- Order Management
- Fast track Master → PhD
- Intellectual Property
- Repository Guidelines
- TF Cheatsheet
- Montmorency Forest Wintertime Dataset
- RTS-GT Dataset 2023
- Deschenes2021 Dataset
- TIGS Dataset
- DRIVE Datasets
- BorealHDR
- TimberSeg 1.0
- DARPA Subterranean Challenge - Urban Dataset
- How to upload a dataset to VALERIA
- ROS1 Bridge
- Migrating a repository to ROS2 (Humble)
- ROS2 and rosbags
- MCAP rosbags
- DDS Configuration (work in progress)
- Using a USB Microphone with ROS2
- ROS2 in VSCode
- ROS2 Troubleshooting