From 88b4b752915e0ceebc95a70f1978bb3b26819796 Mon Sep 17 00:00:00 2001
From: Carlos Alarcon Contents
2.8 Interactive Jobs
2.8.1 Command Line
2.8.2 Graphical Applications
-
2.8.3 Jupyter Notebooks
-
2.8.4 VScode
-
2.9 Scheduler Environment Variables
-
2.10 SSH Keys For MPI
-
2.11 Creating Virtual Environments
-
2.11.1 Anaconda
-
2.11.2 Python
-
2.12 Example Job Script: Fluent
-
2.13 Example Job: efficientdet
-
2.14 Java Jobs
-
2.15 Scheduling On The GPU Nodes
-
2.15.1 P6 on Multi-GPU, Multi-Node
-
2.15.2 CUDA
-
2.15.3 Special Notes for sending CUDA jobs to the GPU Queue
-
2.15.4 OpenISS Examples
-
2.16 Singularity Containers
-
3 Conclusion
-
3.1 Important Limitations
-
3.2 Tips/Tricks
-
3.3 Use Cases
-
A History
-
A.1 Acknowledgments
-
A.2 Migration from UGE to SLURM
-
A.3 Phases
-
A.3.1 Phase 4
-
A.3.2 Phase 3
-
A.3.3 Phase 2
-
A.3.4 Phase 1
-
B Frequently Asked Questions
-
B.1 Where do I learn about Linux?
-
B.2 How to use the “bash shell” on Speed?
-
B.2.1 How do I set bash as my login shell?
-
B.2.2 How do I move into a bash shell on Speed?
-
B.2.3 How do I use the bash shell in an interactive session on Speed?
-
B.2.4 How do I run scripts written in bash on Speed?
-
B.3 How to resolve “Disk quota exceeded” errors?
-
B.3.1 Probable Cause
-
B.3.2 Possible Solutions
-
B.3.3 Example of setting working directories for COMSOL
-
B.3.4 Example of setting working directories for Python Modules
-
B.4 How do I check my job’s status?
-
B.5 Why is my job pending when nodes are empty?
-
B.5.1 Disabled nodes
-
-
-
-
B.5.2 Error in job submit request.
-
C Sister Facilities
+
2.8.3 Jupyter Notebooks in Singularity
+
2.8.4 Jupyter Labs in Conda and Pytorch
+
2.8.5 VScode
+
2.9 Scheduler Environment Variables
+
2.10 SSH Keys For MPI
+
2.11 Creating Virtual Environments
+
2.11.1 Anaconda
+
2.11.2 Python
+
2.12 Example Job Script: Fluent
+
2.13 Example Job: efficientdet
+
2.14 Java Jobs
+
2.15 Scheduling On The GPU Nodes
+
2.15.1 P6 on Multi-GPU, Multi-Node
+
2.15.2 CUDA
+
2.15.3 Special Notes for sending CUDA jobs to the GPU Queue
+
2.15.4 OpenISS Examples
+
2.16 Singularity Containers
+
3 Conclusion
+
3.1 Important Limitations
+
3.2 Tips/Tricks
+
3.3 Use Cases
+
A History
+
A.1 Acknowledgments
+
A.2 Migration from UGE to SLURM
+
A.3 Phases
+
A.3.1 Phase 4
+
A.3.2 Phase 3
+
A.3.3 Phase 2
+
A.3.4 Phase 1
+
B Frequently Asked Questions
+
B.1 Where do I learn about Linux?
+
B.2 How to use the “bash shell” on Speed?
+
B.2.1 How do I set bash as my login shell?
+
B.2.2 How do I move into a bash shell on Speed?
+
B.2.3 How do I use the bash shell in an interactive session on Speed?
+
B.2.4 How do I run scripts written in bash on Speed?
+
B.3 How to resolve “Disk quota exceeded” errors?
+
B.3.1 Probable Cause
+
B.3.2 Possible Solutions
+
B.3.3 Example of setting working directories for COMSOL
+
B.3.4 Example of setting working directories for Python Modules
+
B.4 How do I check my job’s status?
+
B.5 Why is my job pending when nodes are empty?
+
+
+
+
B.5.1 Disabled nodes
+
B.5.2 Error in job submit request.
+
C Sister Facilities
Annotated Bibliography
@@ -313,7 +314,7 @@ 1.6
After reviewing the “What Speed is” (Section 1.4) and “What Speed is Not” (Section 1.5), request -access to the “Speed” cluster by emailing: rt-ex-hpc AT encs.concordia.ca. CGS ENCS +access to the “Speed” cluster by emailing: rt-ex-hpc AT encs.concordia.ca. GCS ENCS faculty and staff may request access directly. Students must include the following in their message:
@@ -327,7 +328,7 @@Non-GCS faculty / students need to get a “sponsor” within GCS, such that your guest GCS ENCS account is created first. A sponsor can be any GCS Faculty member you collaborate with. Failing that, request the approval from our Dean’s Office; via our Associate Deans Drs. Eddie Hoi Ng or -Emad Shihab. External entities to Concordia who collaborate with CGS Concordia researchers, should +Emad Shihab. External entities to Concordia who collaborate with GCS Concordia researchers, should also go through the Dean’s office for approvals. Non-GCS students taking a GCS course do have their GCS ENCS account created automatically, but still need the course instructor’s approval to use the service. @@ -780,12 +781,48 @@
See man sacct or sacct -e for details of the available formatting options. You can define your preferred default format in the SACCT_FORMAT environment variable in your .cshrc or .bashrc files. +
+seff [job-ID]: reports on the efficiency of a job’s cpu and memory utilization. Don’t execute it + on RUNNING jobs (only on completed/finished jobs), efficiency statistics may be + misleading. +
If you define the following directives in your batch script, you will receive seff output in your + email when your job is finished. + + + +
++ #SBATCH --mail-type=ALL + #SBATCH --mail-user=USER_NAME@encs.concordia.ca + ## Replace USER_NAME with your encs username. ++
+
Output example: + + + +
++ Job ID: XXXXX + Cluster: speed + User/Group: user1/user1 + State: COMPLETED (exit code 0) + Nodes: 1 + Cores per node: 4 + CPU Utilized: 00:04:29 + CPU Efficiency: 0.35% of 21:32:20 core-walltime + Job Wall-clock time: 05:23:05 + Memory Utilized: 2.90 GB + Memory Efficiency: 2.90% of 100.00 GB ++
+
In addition to the basic sbatch options presented earlier, there are a few additional options that are +
In addition to the basic sbatch options presented earlier, there are a few additional options that are generally useful:
The many sbatch options available are read with, man sbatch. Also note that sbatch options can +
The many sbatch options available are read with, man sbatch. Also note that sbatch options can be specified during the job-submission command, and these override existing script options (if present). The syntax is, sbatch [options] PATHTOSCRIPT, but unlike in the script, the options are specified without the leading #SBATCH (e.g., sbatch -J sub-test --chdir=./ --mem=1G ./tcsh.sh). -
+
Array jobs are those that start a batch job or a parallel job multiple times. Each iteration of the job +
Array jobs are those that start a batch job or a parallel job multiple times. Each iteration of the job array is called a task and receives a unique job ID. Only supported for batch jobs; submit time \(< 1\) second, compared to repeatedly submitting the same regular job over and over even from a script. -
To submit an array job, use the --array option of the sbatch command as follows: +
To submit an array job, use the --array option of the sbatch command as follows:
-+sbatch --array=n-m[:s]] <batch_script>--
-t Option Syntax:
++
-t Option Syntax:
Examples:
+Examples:
Output files for Array Jobs: -
The default and output and error-files are slurm-job_id_task_id.out. This means that Speed +
Output files for Array Jobs: +
The default and output and error-files are slurm-job_id_task_id.out. This means that Speed creates an output and an error-file for each task generated by the array-job as well as one for the super-ordinate array-job. To alter this behavior use the -o and -e option of sbatch. -
For more details about Array Job options, please review the manual pages for sbatch by executing +
For more details about Array Job options, please review the manual pages for sbatch by executing the following at the command line on speed-submit man sbatch. -
+
For jobs that can take advantage of multiple machine cores, up to 32 cores (per job) can be requested +
For jobs that can take advantage of multiple machine cores, up to 32 cores (per job) can be requested in your script with:
-+#SBATCH -n [#cores for processes]--
or +
+
or
-+#SBATCH -n 1 #SBATCH -c [#cores for threads of a single process]--
Both sbatch and salloc support -n on the command line, and it should always be used either in +
+
Both sbatch and salloc support -n on the command line, and it should always be used either in the script or on the command line as the default \(n=1\). Do not request more cores than you think will be useful, as larger-core jobs are more difficult to schedule. On the flip side, though, if you are going to be running a program that scales out to the maximum single-machine core count available, please (please) request 32 cores, to avoid node oversubscription (i.e., to avoid overloading the CPUs). -
Important note about --ntasks or --ntasks-per-node (-n) talks about processes (usually the +
Important note about --ntasks or --ntasks-per-node (-n) talks about processes (usually the ones ran with srun). --cpus-per-task (-c) corresponds to threads per process. Some programs consider them equivalent, some don’t. Fluent for example uses --ntasks-per-node=8 and --cpus-per-task=1, some just set --cpus-per-task=8 and --ntasks-per-node=1. If one of them is not \(1\) then some applications need to be told to use \(n*c\) total cores. -
Core count associated with a job appears under, “AllocCPUS”, in the, qacct -j, output. +
Core count associated with a job appears under, “AllocCPUS”, in the, qacct -j, output.
-+[serguei@speed-submit src] % squeue -l Thu Oct 19 20:32:32 2023 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) @@ -919,58 +956,58 @@-
-
+
+
2.8 Interactive Jobs
-Job sessions can be interactive, instead of batch (script) based. Such sessions can be useful for testing, +
Job sessions can be interactive, instead of batch (script) based. Such sessions can be useful for testing, debugging, and optimising code and resource requirements, conda or python virtual environments setup, or any likewise preparatory work prior to batch submission. -
+
2.8.1 Command Line
-To request an interactive job session, use, salloc [options], similarly to a sbatch command-line +
To request an interactive job session, use, salloc [options], similarly to a sbatch command-line job, e.g.,
-+salloc -J interactive-test --mem=1G -p ps -n 8-Inside the allocated salloc session you can run shell commands as usual; it is recommended to use +
Inside the allocated salloc session you can run shell commands as usual; it is recommended to use srun for the heavy compute steps inside salloc. If it is a quick a short job just to compile something, e.g., on a GPU node you can use an interactive srun directly (note no srun can run within srun), e.g., a 1 hour allocation: -
For tcsh: +
For tcsh:
-+srun --pty -n 8 -p pg --gpus=1 --mem=1Gb -t 60 /encs/bin/tcsh--
For bash: +
+
For bash:
-+srun --pty -n 8 -p pg --gpus=1 --mem=1Gb -t 60 /encs/bin/bash--
+
+
2.8.2 Graphical Applications
-If you need to run an on-Speed graphical-based UI application (e.g., MALTLAB, Abaqus CME, etc.), +
If you need to run an on-Speed graphical-based UI application (e.g., MALTLAB, Abaqus CME, etc.), or an IDE (PyCharm, VSCode, Eclipse) to develop and test your job’s code interactively you need to enable X11-forwarding from your client machine to speed then to the compute node. To do so: -
+
-
- -
you need to run an X server on your client machine, such as,
+you need to run an X server on your client machine, such as,
-
- on Windows: MobaXterm with X turned on, or Xming + PuTTY with X11 forwarding, or XOrg under Cygwin @@ -978,17 +1015,17 @@
on macOS: XQuarz – use its xterm and ssh -X
- on Linux just use ssh -X speed.encs.concordia.ca
See https://www.concordia.ca/ginacody/aits/support/faq/xserver.html for +
See https://www.concordia.ca/ginacody/aits/support/faq/xserver.html for details.
- -
verify your X connection was properly forwarded by printing the DISPLAY variable: -
echo $DISPLAY If it has no output, then your X forwarding is not on and you may need to +
verify your X connection was properly forwarded by printing the DISPLAY variable: +
echo $DISPLAY If it has no output, then your X forwarding is not on and you may need to re-login to Speed.
- -
Use the --x11 with salloc or srun: -
salloc ... --x11=first ... +
Use the --x11 with salloc or srun: +
salloc ... --x11=first ... @@ -996,30 +1033,30 @@
Once landed on a compute node, verify DISPLAY again.
- -
While running under scheduler, create a run-user directory and set the variable +
While running under scheduler, create a run-user directory and set the variable XDG_RUNTIME_DIR.
-+mkdir -p /speed-scratch/$USER/run-dir setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir-+
- -
Launch your graphical application: -
module load the required version, then matlab, or abaqus cme, etc.
Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local +
Launch your graphical application: +
module load the required version, then matlab, or abaqus cme, etc.
+Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local installation. You can make a similar install under your own directory. If using VSCode, it’s currently only supported with the --no-sandbox option.
-BASH version: +
BASH version:
-+bash-3.2$ ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too) serguei@speed’s password: [serguei@speed-submit ~] % echo $DISPLAY @@ -1032,13 +1069,13 @@-
TCSH version: +
+
TCSH version:
-+ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too) [speed-submit] [/home/c/carlos] > echo $DISPLAY localhost:14.0 @@ -1053,7 +1090,7 @@+
-
2.8.3 Jupyter Notebooks
-This is an example of running Jupyter notebooks together with Singularity (more on Singularity see +
2.8.3 Jupyter Notebooks in Singularity
+This is an example of running Jupyter notebooks together with Singularity (more on Singularity see Section 2.16). Here we are using one of the OpenISS-derived containers (see Section 2.15.4 as well). -
+
@@ -1168,64 +1205,126 @@
- Use the --x11 with salloc or srun as described in the above example
- Load Singularity module module load singularity/3.10.4/default
- -
Execute this Singularity command on a single line. It’s best to save it in a shell script that you +
Execute this Singularity command on a single line. It’s best to save it in a shell script that you could call, since it’s long.
-+srun singularity exec -B $PWD\:/speed-pwd,/speed-scratch/$USER\:/my-speed-scratch,/nettemp \ --env SHELL=/bin/bash --nv /speed-scratch/nag-public/openiss-cuda-conda-jupyter.sif \ /bin/bash -c ’/opt/conda/bin/jupyter notebook --no-browser --notebook-dir=/speed-pwd \ --ip="*" --port=8888 --allow-root’-+
- -
Create an ssh tunnel between your computer and the node (speed-XX) where Jupyter is +
Create an ssh tunnel between your computer and the node (speed-XX) where Jupyter is running (Using speed-submit as a “jump server”) (Preferably: PuTTY, see Figure 5 and Figure 6)
-+ssh -L 8888:speed-XX:8888 YOUR_USER@speed-submit.encs.concordia.ca-Don’t close the tunnel. +
Don’t close the tunnel.
- -
Open a browser, and copy your Jupyter’s token, in the screenshot example in Figure 7; each +
Open a browser, and copy your Jupyter’s token, in the screenshot example in Figure 7; each time the token will be different, as it printed to you in the terminal.
-+http://localhost:8888/?token=5a52e6c0c7dfc111008a803e5303371ed0462d3d547ac3fb-+
- Work with your notebook.
2.8 -
2.8.4 VScode
-This is an example of running VScode, it’s similar to Jupyter notebooks, but it doesn’t use containers. -This a Web version, it exists the local(workstation)-remote(speed-node) version too, but it is for -Advanced users (no support, execute it at your own risk). +
2.8.4 Jupyter Labs in Conda and Pytorch
+This is an example of Jupyter Labs running in a Conda environment, with Pytorch
+
- -
Environment preparation: for the FIRST time: +
Environment preparation: for the FIRST time:
+
- Go to your speed-scratch directory: cd /speed-scratch/$USER
-- Create a vscode directory: mkdir vscode +
- Create a Jupyter (name of your choice) directory: mkdir -p Jupyter
-- Go to vscode: cd vscode +
- Go to Jupyter: cd Jupyter
-- Create home and projects: mkdir {home,projects} +
- Open an Interactive session: salloc --mem=50G --gpus=1 -ppg (or -ppt)
-- Create this directory: mkdir -p /speed-scratch/$USER/run-user
- +
Set env. variables, conda environment, jupyter+pytorch installation + + + +
++ module load anaconda3/2023.03/default + setenv TMPDIR /speed-scratch/$USER/tmp + setenv TMP /speed-scratch/$USER/tmp + setenv CONDA_PKGS_DIRS /speed-scratch/$USER/Jupyter/pkgs + conda create -p /speed-scratch/$USER/Jupyter/jupyter-env + conda activate /speed-scratch/$USER/Jupyter/jupyter-env + conda install -c conda-forge jupyterlab + pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 + exit ++- -
Running VScode +
Running Jupyter Labs, from speed-submit: +
+
+- +
+Open an Interactive session: salloc --mem=50G --gpus=1 -ppg (or -ppt) + + + +
++ cd /speed-scratch/$USER/Jupyter + module load anaconda3/2023.03/default + setenv TMPDIR /speed-scratch/$USER/tmp + setenv TMP /speed-scratch/$USER/tmp + setenv CONDA_PKGS_DIRS /speed-scratch/$USER/Jupyter/pkgs + conda activate /speed-scratch/$USER/Jupyter/jupyter-env + jupyter lab --no-browser --notebook-dir=$PWD --ip="*" --port=8888 --port-retries=50 +++
- Verify which port the system has assigned to Jupyter: http://localhost:XXXX/lab?token= +
+- SSH Tunnel creation: similar to Jupyter in Singularity, see Section 2.8.3 +
+- Open a browser and type: localhost:XXXX (port assigned)
+
+2.8.5 VScode
+This is an example of running VScode, it’s similar to Jupyter notebooks, but it doesn’t use containers. +This a Web version, it exists the local(workstation)-remote(speed-node) version too, but it is for +Advanced users (no support, execute it at your own risk). +
++
@@ -1234,18 +1333,18 @@- +
+Environment preparation: for the FIRST time:
-
+- Go to your vscode directory: cd /speed-scratch/$USER/vscode +
- Go to your speed-scratch directory: cd /speed-scratch/$USER
-- Open interactive session: salloc --mem=10Gb --constraint=el9 +
- Create a vscode directory: mkdir vscode
-- Set environment variable: setenv XDG_RUNTIME_DIR +
- Go to vscode: cd vscode +
+- Create home and projects: mkdir {home,projects} +
+- Create this directory: mkdir -p /speed-scratch/$USER/run-user
- +
Running VScode +
+
- Go to your vscode directory: cd /speed-scratch/$USER/vscode +
+- Open interactive session: salloc --mem=10Gb --constraint=el9 +
+- Set environment variable: setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-user
-- -
Run VScode, change the port if needed. +
- +
-Run VScode, change the port if needed.
-+/speed-scratch/nag-public/code-server-4.22.1/bin/code-server --user-data-dir=$PWD\/projects \ --config=$PWD\/home/.config/code-server/config.yaml --bind-addr="0.0.0.0:8080" $PWD\/projects-+
- Tunnel ssh creation: similar to Jupyter, see Section 2.8.3 +
- SSH Tunnel creation: similar to Jupyter, see Section 2.8.3
-- Open a browser and type: localhost:8080 +
- Open a browser and type: localhost:8080
-- -
If the browser asks for password: +
- +
If the browser asks for password:
-+cat /speed-scratch/$USER/vscode/home/.config/code-server/config.yaml-+
2.8.4 + -
Figure 8: VScode running on a Speed node +Figure 8: VScode running on a Speed node -2.9 Scheduler Environment Variables
-The scheduler presents a number of environment variables that can be used in your jobs. You can +
2.9 Scheduler Environment Variables
+The scheduler presents a number of environment variables that can be used in your jobs. You can invoke env or printenv in your job to know what hose are (most begin with the prefix SLURM). Some of the more useful ones are:
@@ -1265,48 +1364,48 @@$SLURM_ARRAY_TASK_ID=for array jobs (see Section 2.6).
- -See a more complete list here: +
See a more complete list here:
In Figure 9 is a sample script, using some of these. +
In Figure 9 is a sample script, using some of these.
-2.10 SSH Keys For MPI
+2.10 SSH Keys For MPI
Some programs effect their parallel processing via MPI (which is a communication protocol). An example of such software is Fluent. MPI needs to have ‘passwordless login’ set up, which means SSH keys. In your NFS-mounted home directory: @@ -1324,7 +1423,7 @@
2.10 permissions by default).
-
2.11 Creating Virtual Environments
+2.11 Creating Virtual Environments
The following documentation is specific to the Speed HPC Facility at the Gina Cody School of Engineering and Computer Science. Virtual environments typically instantiated via Conda or Python. Another option is Singularity detailed in Section 2.16. Usually, virtual environments are created once @@ -1333,7 +1432,7 @@
-
2.11.1 Anaconda
+2.11.1 Anaconda
Request an interactive session in the queue you wish to submit your jobs to (e.g., salloc -p pg –gpus=1 for GPU jobs). Once your interactive has started, create an anaconda environment in your speed-scratch directory by using the prefix option when executing conda create. For example, @@ -1346,7 +1445,7 @@
2.11.1 -
+module load anaconda3/2023.03/default conda create --prefix /speed-scratch/a_user/myconda@@ -1354,13 +1453,13 @@2.11.1
Note: Without the prefix option, the conda create command creates the environment in a_user’s home directory by default.
-List Environments. To view your conda environments, type: conda info --envs
-+# conda environments: # base * /encs/pkg/anaconda3-2023.03/root @@ -1368,13 +1467,13 @@2.11.1
-
Activate an Environment. Activate the environment speedscratcha_usermyconda as follows
-+conda activate /speed-scratch/a_user/mycondaAfter activating your environment, add pip to your environment by using @@ -1382,7 +1481,7 @@
2.11.1 -
+conda install pipThis will install pip and pip’s dependencies, including python, into the environment. @@ -1394,7 +1493,7 @@
2.11.1 -
+salloc -p pg --gpus=1 --mem=10GB -A <slurm account name> cd /speed-scratch/$USER module load python/3.11.0/default @@ -1414,7 +1513,7 @@2.11.1 conda install installs modules from anaconda’s repository.
-
2.11.2 Python
+2.11.2 Python
Setting up a Python virtual environment is fairly straightforward. The first step is to request an interactive session in the queue you wish to submit your jobs to.
We have a simple example that use a Python virtual environment: @@ -1426,7 +1525,7 @@
2.11.2 -
+salloc -p pg --gpus=1 --mem=10GB -A <slurm account name> cd /speed-scratch/$USER module load python/3.9.1/default @@ -1446,51 +1545,51 @@2.11.2 --gpus= when preparing environments for CPU jobs.
-
2.12 Example Job Script: Fluent
+2.12 Example Job Script: Fluent