From 88b4b752915e0ceebc95a70f1978bb3b26819796 Mon Sep 17 00:00:00 2001 From: Carlos Alarcon Date: Mon, 29 Apr 2024 10:11:32 -0400 Subject: [PATCH] jupyter+conda+pytorch, no singularity --- doc/web/index.html | 693 ++++++++++++++++++++++++++------------------- 1 file changed, 396 insertions(+), 297 deletions(-) diff --git a/doc/web/index.html b/doc/web/index.html index 99fe53e..d3c469e 100644 --- a/doc/web/index.html +++ b/doc/web/index.html @@ -67,54 +67,55 @@

Contents


 2.8 Interactive Jobs
  2.8.1 Command Line
  2.8.2 Graphical Applications -
  2.8.3 Jupyter Notebooks -
  2.8.4 VScode -
 2.9 Scheduler Environment Variables -
 2.10 SSH Keys For MPI -
 2.11 Creating Virtual Environments -
  2.11.1 Anaconda -
  2.11.2 Python -
 2.12 Example Job Script: Fluent -
 2.13 Example Job: efficientdet -
 2.14 Java Jobs -
 2.15 Scheduling On The GPU Nodes -
  2.15.1 P6 on Multi-GPU, Multi-Node -
  2.15.2 CUDA -
  2.15.3 Special Notes for sending CUDA jobs to the GPU Queue -
  2.15.4 OpenISS Examples -
 2.16 Singularity Containers -
3 Conclusion -
 3.1 Important Limitations -
 3.2 Tips/Tricks -
 3.3 Use Cases -
A History -
 A.1 Acknowledgments -
 A.2 Migration from UGE to SLURM -
 A.3 Phases -
  A.3.1 Phase 4 -
  A.3.2 Phase 3 -
  A.3.3 Phase 2 -
  A.3.4 Phase 1 -
B Frequently Asked Questions -
 B.1 Where do I learn about Linux? -
 B.2 How to use the “bash shell” on Speed? -
  B.2.1 How do I set bash as my login shell? -
  B.2.2 How do I move into a bash shell on Speed? -
  B.2.3 How do I use the bash shell in an interactive session on Speed? -
  B.2.4 How do I run scripts written in bash on Speed? -
 B.3 How to resolve “Disk quota exceeded” errors? -
  B.3.1 Probable Cause -
  B.3.2 Possible Solutions -
  B.3.3 Example of setting working directories for COMSOL -
  B.3.4 Example of setting working directories for Python Modules -
 B.4 How do I check my job’s status? -
 B.5 Why is my job pending when nodes are empty? -
  B.5.1 Disabled nodes - - - -
  B.5.2 Error in job submit request. -
C Sister Facilities +
  2.8.3 Jupyter Notebooks in Singularity +
  2.8.4 Jupyter Labs in Conda and Pytorch +
  2.8.5 VScode +
 2.9 Scheduler Environment Variables +
 2.10 SSH Keys For MPI +
 2.11 Creating Virtual Environments +
  2.11.1 Anaconda +
  2.11.2 Python +
 2.12 Example Job Script: Fluent +
 2.13 Example Job: efficientdet +
 2.14 Java Jobs +
 2.15 Scheduling On The GPU Nodes +
  2.15.1 P6 on Multi-GPU, Multi-Node +
  2.15.2 CUDA +
  2.15.3 Special Notes for sending CUDA jobs to the GPU Queue +
  2.15.4 OpenISS Examples +
 2.16 Singularity Containers +
3 Conclusion +
 3.1 Important Limitations +
 3.2 Tips/Tricks +
 3.3 Use Cases +
A History +
 A.1 Acknowledgments +
 A.2 Migration from UGE to SLURM +
 A.3 Phases +
  A.3.1 Phase 4 +
  A.3.2 Phase 3 +
  A.3.3 Phase 2 +
  A.3.4 Phase 1 +
B Frequently Asked Questions +
 B.1 Where do I learn about Linux? +
 B.2 How to use the “bash shell” on Speed? +
  B.2.1 How do I set bash as my login shell? +
  B.2.2 How do I move into a bash shell on Speed? +
  B.2.3 How do I use the bash shell in an interactive session on Speed? +
  B.2.4 How do I run scripts written in bash on Speed? +
 B.3 How to resolve “Disk quota exceeded” errors? +
  B.3.1 Probable Cause +
  B.3.2 Possible Solutions +
  B.3.3 Example of setting working directories for COMSOL +
  B.3.4 Example of setting working directories for Python Modules +
 B.4 How do I check my job’s status? +
 B.5 Why is my job pending when nodes are empty? + + + +
  B.5.1 Disabled nodes +
  B.5.2 Error in job submit request. +
C Sister Facilities
Annotated Bibliography @@ -313,7 +314,7 @@

1.6

1.7 Requesting Access

After reviewing the “What Speed is” (Section 1.4) and “What Speed is Not” (Section 1.5), request -access to the “Speed” cluster by emailing: rt-ex-hpc AT encs.concordia.ca. CGS ENCS +access to the “Speed” cluster by emailing: rt-ex-hpc AT encs.concordia.ca. GCS ENCS faculty and staff may request access directly. Students must include the following in their message:

@@ -327,7 +328,7 @@

1.7

Non-GCS faculty / students need to get a “sponsor” within GCS, such that your guest GCS ENCS account is created first. A sponsor can be any GCS Faculty member you collaborate with. Failing that, request the approval from our Dean’s Office; via our Associate Deans Drs. Eddie Hoi Ng or -Emad Shihab. External entities to Concordia who collaborate with CGS Concordia researchers, should +Emad Shihab. External entities to Concordia who collaborate with GCS Concordia researchers, should also go through the Dean’s office for approvals. Non-GCS students taking a GCS course do have their GCS ENCS account created automatically, but still need the course instructor’s approval to use the service. @@ -780,12 +781,48 @@

See man sacct or sacct -e for details of the available formatting options. You can define your preferred default format in the SACCT_FORMAT environment variable in your .cshrc or .bashrc files. +

+
  • +

    seff [job-ID]: reports on the efficiency of a job’s cpu and memory utilization. Don’t execute it + on RUNNING jobs (only on completed/finished jobs), efficiency statistics may be + misleading. +

    If you define the following directives in your batch script, you will receive seff output in your + email when your job is finished. + + + +

    +
    +     #SBATCH --mail-type=ALL
    +     #SBATCH --mail-user=USER_NAME@encs.concordia.ca
    +     ## Replace USER_NAME with your encs username.
    +
    +

    +

    Output example: + + + +

    +
    +     Job ID: XXXXX
    +     Cluster: speed
    +     User/Group: user1/user1
    +     State: COMPLETED (exit code 0)
    +     Nodes: 1
    +     Cores per node: 4
    +     CPU Utilized: 00:04:29
    +     CPU Efficiency: 0.35% of 21:32:20 core-walltime
    +     Job Wall-clock time: 05:23:05
    +     Memory Utilized: 2.90 GB
    +     Memory Efficiency: 2.90% of 100.00 GB
    +
    +

  • -

    +

    2.5 Advanced sbatch Options

    -

    In addition to the basic sbatch options presented earlier, there are a few additional options that are +

    In addition to the basic sbatch options presented earlier, there are a few additional options that are generally useful:

      @@ -797,12 +834,12 @@

    • --mail-user email@domain.com: requests that the scheduler use this e-mail notification address, rather than the default (see, --mail-type). - - -
    • --export=[ALL | NONE | variables]: exports environment variable(s) that can be used by the script. + + +
    • -t [min] or DAYS-HH:MM:SS: sets a job runtime of min or HH:MM:SS. Note that if you give a single number, that represents minutes, not hours. @@ -810,35 +847,35 @@

    • --depend=[state:job-ID]: run this job only when job [job-ID] finishes. Held jobs appear in the queue.
    -

    The many sbatch options available are read with, man sbatch. Also note that sbatch options can +

    The many sbatch options available are read with, man sbatch. Also note that sbatch options can be specified during the job-submission command, and these override existing script options (if present). The syntax is, sbatch [options] PATHTOSCRIPT, but unlike in the script, the options are specified without the leading #SBATCH (e.g., sbatch -J sub-test --chdir=./ --mem=1G ./tcsh.sh). -

    +

    2.6 Array Jobs

    -

    Array jobs are those that start a batch job or a parallel job multiple times. Each iteration of the job +

    Array jobs are those that start a batch job or a parallel job multiple times. Each iteration of the job array is called a task and receives a unique job ID. Only supported for batch jobs; submit time \(< 1\) second, compared to repeatedly submitting the same regular job over and over even from a script. -

    To submit an array job, use the --array option of the sbatch command as follows: +

    To submit an array job, use the --array option of the sbatch command as follows:

    -
    +   
     sbatch --array=n-m[:s]] <batch_script>
     
    -

    -

    -t Option Syntax:

    +

    +

    -t Option Syntax:

    • n: indicates the start-id.
    • m: indicates the max-id.
    • s: indicates the step size.
    -

    Examples:

    +

    Examples:

    • sbatch --array=1-50000 -N1 -i my_in_%a -o my_out_%a array.sh: submits a job with 50000 elements, %a maps to the task-id between 1 and 50K. @@ -850,56 +887,56 @@

      2.6

    • sbatch --array=3-15:3 array.sh: submits a jobs with 5 tasks numbered consecutively with step size 3 (task-ids 3,6,9,12,15).
    -

    Output files for Array Jobs: -

    The default and output and error-files are slurm-job_id_task_id.out. This means that Speed +

    Output files for Array Jobs: +

    The default and output and error-files are slurm-job_id_task_id.out. This means that Speed creates an output and an error-file for each task generated by the array-job as well as one for the super-ordinate array-job. To alter this behavior use the -o and -e option of sbatch. -

    For more details about Array Job options, please review the manual pages for sbatch by executing +

    For more details about Array Job options, please review the manual pages for sbatch by executing the following at the command line on speed-submit man sbatch. -

    +

    2.7 Requesting Multiple Cores (i.e., Multithreading Jobs)

    -

    For jobs that can take advantage of multiple machine cores, up to 32 cores (per job) can be requested +

    For jobs that can take advantage of multiple machine cores, up to 32 cores (per job) can be requested in your script with:

    -
    +   
     #SBATCH -n [#cores for processes]
     
    -

    -

    or +

    +

    or

    -
    +   
     #SBATCH -n 1
     #SBATCH -c [#cores for threads of a single process]
     
    -

    -

    Both sbatch and salloc support -n on the command line, and it should always be used either in +

    +

    Both sbatch and salloc support -n on the command line, and it should always be used either in the script or on the command line as the default \(n=1\). Do not request more cores than you think will be useful, as larger-core jobs are more difficult to schedule. On the flip side, though, if you are going to be running a program that scales out to the maximum single-machine core count available, please (please) request 32 cores, to avoid node oversubscription (i.e., to avoid overloading the CPUs). -

    Important note about --ntasks or --ntasks-per-node (-n) talks about processes (usually the +

    Important note about --ntasks or --ntasks-per-node (-n) talks about processes (usually the ones ran with srun). --cpus-per-task (-c) corresponds to threads per process. Some programs consider them equivalent, some don’t. Fluent for example uses --ntasks-per-node=8 and --cpus-per-task=1, some just set --cpus-per-task=8 and --ntasks-per-node=1. If one of them is not \(1\) then some applications need to be told to use \(n*c\) total cores. -

    Core count associated with a job appears under, “AllocCPUS”, in the, qacct -j, output. +

    Core count associated with a job appears under, “AllocCPUS”, in the, qacct -j, output.

    -
    +   
     [serguei@speed-submit src] % squeue -l
     Thu Oct 19 20:32:32 2023
     JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
    @@ -919,58 +956,58 @@ 

    -

    -

    +

    +

    2.8 Interactive Jobs

    -

    Job sessions can be interactive, instead of batch (script) based. Such sessions can be useful for testing, +

    Job sessions can be interactive, instead of batch (script) based. Such sessions can be useful for testing, debugging, and optimising code and resource requirements, conda or python virtual environments setup, or any likewise preparatory work prior to batch submission. -

    +

    2.8.1 Command Line
    -

    To request an interactive job session, use, salloc [options], similarly to a sbatch command-line +

    To request an interactive job session, use, salloc [options], similarly to a sbatch command-line job, e.g.,

    -
    +   
     salloc -J interactive-test --mem=1G -p ps -n 8
     
    -

    Inside the allocated salloc session you can run shell commands as usual; it is recommended to use +

    Inside the allocated salloc session you can run shell commands as usual; it is recommended to use srun for the heavy compute steps inside salloc. If it is a quick a short job just to compile something, e.g., on a GPU node you can use an interactive srun directly (note no srun can run within srun), e.g., a 1 hour allocation: -

    For tcsh: +

    For tcsh:

    -
    +   
     srun --pty -n 8 -p pg --gpus=1 --mem=1Gb -t 60 /encs/bin/tcsh
     
    -

    -

    For bash: +

    +

    For bash:

    -
    +   
     srun --pty -n 8 -p pg --gpus=1 --mem=1Gb -t 60 /encs/bin/bash
     
    -

    -

    +

    +

    2.8.2 Graphical Applications
    -

    If you need to run an on-Speed graphical-based UI application (e.g., MALTLAB, Abaqus CME, etc.), +

    If you need to run an on-Speed graphical-based UI application (e.g., MALTLAB, Abaqus CME, etc.), or an IDE (PyCharm, VSCode, Eclipse) to develop and test your job’s code interactively you need to enable X11-forwarding from your client machine to speed then to the compute node. To do so: -

    +

    1. -

      you need to run an X server on your client machine, such as,

      +

      you need to run an X server on your client machine, such as,

      • on Windows: MobaXterm with X turned on, or Xming + PuTTY with X11 forwarding, or XOrg under Cygwin @@ -978,17 +1015,17 @@
        on macOS: XQuarz – use its xterm and ssh -X
      • on Linux just use ssh -X speed.encs.concordia.ca
      -

      See https://www.concordia.ca/ginacody/aits/support/faq/xserver.html for +

      See https://www.concordia.ca/ginacody/aits/support/faq/xserver.html for details.

    2. -

      verify your X connection was properly forwarded by printing the DISPLAY variable: -

      echo $DISPLAY If it has no output, then your X forwarding is not on and you may need to +

      verify your X connection was properly forwarded by printing the DISPLAY variable: +

      echo $DISPLAY If it has no output, then your X forwarding is not on and you may need to re-login to Speed.

    3. -

      Use the --x11 with salloc or srun: -

      salloc ... --x11=first ... +

      Use the --x11 with salloc or srun: +

      salloc ... --x11=first ... @@ -996,30 +1033,30 @@

      Once landed on a compute node, verify DISPLAY again.
    4. -

      While running under scheduler, create a run-user directory and set the variable +

      While running under scheduler, create a run-user directory and set the variable XDG_RUNTIME_DIR.

      -
      +     
            mkdir -p /speed-scratch/$USER/run-dir
            setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir
       
      -

      +

    5. -

      Launch your graphical application: -

      module load the required version, then matlab, or abaqus cme, etc.

    -

    Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local +

    Launch your graphical application: +

    module load the required version, then matlab, or abaqus cme, etc.

    +

    Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local installation. You can make a similar install under your own directory. If using VSCode, it’s currently only supported with the --no-sandbox option.
    -

    BASH version: +

    BASH version:

    -
    +   
     bash-3.2$ ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too)
     serguei@speed’s password:
     [serguei@speed-submit ~] % echo $DISPLAY
    @@ -1032,13 +1069,13 @@ 
    -

    TCSH version: +

    +

    TCSH version:

    -
    +   
     ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too)
     [speed-submit] [/home/c/carlos] > echo $DISPLAY
     localhost:14.0
    @@ -1053,7 +1090,7 @@ 
    +

    @@ -1064,62 +1101,62 @@
    PIC +

    PIC

    Figure 4: PyCharm Starting up on a Speed Node
    -
    2.8.3 Jupyter Notebooks
    -

    This is an example of running Jupyter notebooks together with Singularity (more on Singularity see +

    2.8.3 Jupyter Notebooks in Singularity
    +

    This is an example of running Jupyter notebooks together with Singularity (more on Singularity see Section 2.16). Here we are using one of the OpenISS-derived containers (see Section 2.15.4 as well). -

    +

    1. Use the --x11 with salloc or srun as described in the above example
    2. Load Singularity module module load singularity/3.10.4/default
    3. -

      Execute this Singularity command on a single line. It’s best to save it in a shell script that you +

      Execute this Singularity command on a single line. It’s best to save it in a shell script that you could call, since it’s long.

      -
      +     
            srun singularity exec -B $PWD\:/speed-pwd,/speed-scratch/$USER\:/my-speed-scratch,/nettemp \
             --env SHELL=/bin/bash --nv /speed-scratch/nag-public/openiss-cuda-conda-jupyter.sif \
             /bin/bash -c ’/opt/conda/bin/jupyter notebook --no-browser --notebook-dir=/speed-pwd \
             --ip="*" --port=8888 --allow-root’
       
      -

      +

    4. -

      Create an ssh tunnel between your computer and the node (speed-XX) where Jupyter is +

      Create an ssh tunnel between your computer and the node (speed-XX) where Jupyter is running (Using speed-submit as a “jump server”) (Preferably: PuTTY, see Figure 5 and Figure 6)

      -
      +     
            ssh -L 8888:speed-XX:8888 YOUR_USER@speed-submit.encs.concordia.ca
       
      -

      Don’t close the tunnel. +

      Don’t close the tunnel.

    5. -

      Open a browser, and copy your Jupyter’s token, in the screenshot example in Figure 7; each +

      Open a browser, and copy your Jupyter’s token, in the screenshot example in Figure 7; each time the token will be different, as it printed to you in the terminal.

      -
      +     
            http://localhost:8888/?token=5a52e6c0c7dfc111008a803e5303371ed0462d3d547ac3fb
       
      -

      +

    6. Work with your notebook.
    @@ -1168,64 +1205,126 @@
    2.8 -
    2.8.4 VScode
    -

    This is an example of running VScode, it’s similar to Jupyter notebooks, but it doesn’t use containers. -This a Web version, it exists the local(workstation)-remote(speed-node) version too, but it is for -Advanced users (no support, execute it at your own risk). +

    2.8.4 Jupyter Labs in Conda and Pytorch
    +

    This is an example of Jupyter Labs running in a Conda environment, with Pytorch

    • -

      Environment preparation: for the FIRST time: +

      Environment preparation: for the FIRST time:

      1. Go to your speed-scratch directory: cd /speed-scratch/$USER
      2. -
      3. Create a vscode directory: mkdir vscode +
      4. Create a Jupyter (name of your choice) directory: mkdir -p Jupyter
      5. -
      6. Go to vscode: cd vscode +
      7. Go to Jupyter: cd Jupyter
      8. -
      9. Create home and projects: mkdir {home,projects} +
      10. Open an Interactive session: salloc --mem=50G --gpus=1 -ppg (or -ppt)
      11. -
      12. Create this directory: mkdir -p /speed-scratch/$USER/run-user
      +
    • +

      Set env. variables, conda environment, jupyter+pytorch installation + + + +

      +
      +         module load anaconda3/2023.03/default
      +         setenv TMPDIR /speed-scratch/$USER/tmp
      +         setenv TMP /speed-scratch/$USER/tmp
      +         setenv CONDA_PKGS_DIRS /speed-scratch/$USER/Jupyter/pkgs
      +         conda create -p /speed-scratch/$USER/Jupyter/jupyter-env
      +         conda activate /speed-scratch/$USER/Jupyter/jupyter-env
      +         conda install -c conda-forge jupyterlab
      +         pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
      +         exit
      +
      +

    • -

      Running VScode +

      Running Jupyter Labs, from speed-submit: +

        +
      1. +

        Open an Interactive session: salloc --mem=50G --gpus=1 -ppg (or -ppt) + + + +

        +
        +         cd /speed-scratch/$USER/Jupyter
        +         module load anaconda3/2023.03/default
        +         setenv TMPDIR /speed-scratch/$USER/tmp
        +         setenv TMP /speed-scratch/$USER/tmp
        +         setenv CONDA_PKGS_DIRS /speed-scratch/$USER/Jupyter/pkgs
        +         conda activate /speed-scratch/$USER/Jupyter/jupyter-env
        +         jupyter lab --no-browser --notebook-dir=$PWD --ip="*" --port=8888 --port-retries=50
        +
        +

        +

      2. +
      3. Verify which port the system has assigned to Jupyter: http://localhost:XXXX/lab?token= +
      4. +
      5. SSH Tunnel creation: similar to Jupyter in Singularity, see Section 2.8.3 +
      6. +
      7. Open a browser and type: localhost:XXXX (port assigned)
      +
    +

    +

    +
    2.8.5 VScode
    +

    This is an example of running VScode, it’s similar to Jupyter notebooks, but it doesn’t use containers. +This a Web version, it exists the local(workstation)-remote(speed-node) version too, but it is for +Advanced users (no support, execute it at your own risk). +

    +
      +
    • +

      Environment preparation: for the FIRST time:

        -
      1. Go to your vscode directory: cd /speed-scratch/$USER/vscode +
      2. Go to your speed-scratch directory: cd /speed-scratch/$USER
      3. -
      4. Open interactive session: salloc --mem=10Gb --constraint=el9 +
      5. Create a vscode directory: mkdir vscode
      6. -
      7. Set environment variable: setenv XDG_RUNTIME_DIR +
      8. Go to vscode: cd vscode +
      9. +
      10. Create home and projects: mkdir {home,projects} +
      11. +
      12. Create this directory: mkdir -p /speed-scratch/$USER/run-user
      +
    • +
    • +

      Running VScode +

        +
      1. Go to your vscode directory: cd /speed-scratch/$USER/vscode +
      2. +
      3. Open interactive session: salloc --mem=10Gb --constraint=el9 +
      4. +
      5. Set environment variable: setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-user
      6. -
      7. -

        Run VScode, change the port if needed. +

      8. +

        Run VScode, change the port if needed.

        -
        +         
                  /speed-scratch/nag-public/code-server-4.22.1/bin/code-server --user-data-dir=$PWD\/projects \
                  --config=$PWD\/home/.config/code-server/config.yaml --bind-addr="0.0.0.0:8080" $PWD\/projects
         
        -

        +

      9. -
      10. Tunnel ssh creation: similar to Jupyter, see Section 2.8.3 +
      11. SSH Tunnel creation: similar to Jupyter, see Section 2.8.3
      12. -
      13. Open a browser and type: localhost:8080 +
      14. Open a browser and type: localhost:8080
      15. -
      16. -

        If the browser asks for password: +

      17. +

        If the browser asks for password:

        -
        +         
                  cat /speed-scratch/$USER/vscode/home/.config/code-server/config.yaml
         
        -

        +

    @@ -1234,18 +1333,18 @@
    2.8.4 +
    PIC
    -
    Figure 8: VScode running on a Speed node
    +
    Figure 8: VScode running on a Speed node
    -

    2.9 Scheduler Environment Variables

    -

    The scheduler presents a number of environment variables that can be used in your jobs. You can +

    2.9 Scheduler Environment Variables

    +

    The scheduler presents a number of environment variables that can be used in your jobs. You can invoke env or printenv in your job to know what hose are (most begin with the prefix SLURM). Some of the more useful ones are:

    @@ -1265,48 +1364,48 @@

    $SLURM_ARRAY_TASK_ID=for array jobs (see Section 2.6).
  • -

    See a more complete list here: +

    See a more complete list here:

  • -

    In Figure 9 is a sample script, using some of these. +

    In Figure 9 is a sample script, using some of these.

    - + -
    #!/encs/bin/tcsh 
    - 
    -#SBATCH --job-name=tmpdir      ## Give the job a name 
    -#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    -#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    -#SBATCH --chdir=./             ## Use currect directory as working directory 
    -#SBATCH --nodes=1 
    -#SBATCH --ntasks=1 
    -#SBATCH --cpus-per-task=8      ## Request 8 cores 
    -#SBATCH --mem=32G              ## Assign 32G memory per node 
    - 
    -cd $TMPDIR 
    -mkdir input 
    -rsync -av $SLURM_SUBMIT_DIR/references/ input/ 
    -mkdir results 
    -srun STAR --inFiles $TMPDIR/input --parallel $SRUN_CPUS_PER_TASK --outFiles $TMPDIR/results 
    -rsync -av $TMPDIR/results/ $SLURM_SUBMIT_DIR/processed/
    +
    #!/encs/bin/tcsh 
    + 
    +#SBATCH --job-name=tmpdir      ## Give the job a name 
    +#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    +#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    +#SBATCH --chdir=./             ## Use currect directory as working directory 
    +#SBATCH --nodes=1 
    +#SBATCH --ntasks=1 
    +#SBATCH --cpus-per-task=8      ## Request 8 cores 
    +#SBATCH --mem=32G              ## Assign 32G memory per node 
    + 
    +cd $TMPDIR 
    +mkdir input 
    +rsync -av $SLURM_SUBMIT_DIR/references/ input/ 
    +mkdir results 
    +srun STAR --inFiles $TMPDIR/input --parallel $SRUN_CPUS_PER_TASK --outFiles $TMPDIR/results 
    +rsync -av $TMPDIR/results/ $SLURM_SUBMIT_DIR/processed/
     
    -
    Figure 9: Source code for tmpdir.sh
    +
    Figure 9: Source code for tmpdir.sh
    -

    2.10 SSH Keys For MPI

    +

    2.10 SSH Keys For MPI

    Some programs effect their parallel processing via MPI (which is a communication protocol). An example of such software is Fluent. MPI needs to have ‘passwordless login’ set up, which means SSH keys. In your NFS-mounted home directory: @@ -1324,7 +1423,7 @@

    2.10 permissions by default).

    -

    2.11 Creating Virtual Environments

    +

    2.11 Creating Virtual Environments

    The following documentation is specific to the Speed HPC Facility at the Gina Cody School of Engineering and Computer Science. Virtual environments typically instantiated via Conda or Python. Another option is Singularity detailed in Section 2.16. Usually, virtual environments are created once @@ -1333,7 +1432,7 @@

    -
    2.11.1 Anaconda
    +
    2.11.1 Anaconda

    Request an interactive session in the queue you wish to submit your jobs to (e.g., salloc -p pg –gpus=1 for GPU jobs). Once your interactive has started, create an anaconda environment in your speed-scratch directory by using the prefix option when executing conda create. For example, @@ -1346,7 +1445,7 @@

    2.11.1 -
    +   
     module load anaconda3/2023.03/default
     conda create --prefix /speed-scratch/a_user/myconda
     
    @@ -1354,13 +1453,13 @@
    2.11.1

    Note: Without the prefix option, the conda create command creates the environment in a_user’s home directory by default.

    -

    List Environments. +

    List Environments. To view your conda environments, type: conda info --envs

    -
    +   
     # conda environments:
     #
     base                  *  /encs/pkg/anaconda3-2023.03/root
    @@ -1368,13 +1467,13 @@ 
    2.11.1

    -

    Activate an Environment. +

    Activate an Environment. Activate the environment speedscratcha_usermyconda as follows

    -
    +   
     conda activate /speed-scratch/a_user/myconda
     

    After activating your environment, add pip to your environment by using @@ -1382,7 +1481,7 @@

    2.11.1 -
    +   
     conda install pip
     

    This will install pip and pip’s dependencies, including python, into the environment. @@ -1394,7 +1493,7 @@

    2.11.1 -
    +     
          salloc -p pg --gpus=1 --mem=10GB -A <slurm account name>
          cd /speed-scratch/$USER
          module load python/3.11.0/default
    @@ -1414,7 +1513,7 @@ 
    2.11.1 conda install installs modules from anaconda’s repository.

    -
    2.11.2 Python
    +
    2.11.2 Python

    Setting up a Python virtual environment is fairly straightforward. The first step is to request an interactive session in the queue you wish to submit your jobs to.

    We have a simple example that use a Python virtual environment: @@ -1426,7 +1525,7 @@

    2.11.2 -
    +     
          salloc -p pg --gpus=1 --mem=10GB -A <slurm account name>
          cd /speed-scratch/$USER
          module load python/3.9.1/default
    @@ -1446,51 +1545,51 @@ 
    2.11.2 --gpus= when preparing environments for CPU jobs.

    -

    2.12 Example Job Script: Fluent

    +

    2.12 Example Job Script: Fluent

    - + -
    #!/encs/bin/tcsh 
    - 
    -#SBATCH --job-name=flu10000    ## Give the job a name 
    -#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    -#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    -#SBATCH --chdir=./             ## Use currect directory as working directory 
    -#SBATCH --nodes=1              ## Number of nodes to run on 
    -#SBATCH --ntasks-per-node=32   ## Number of cores 
    -#SBATCH --cpus-per-task=1      ## Number of MPI threads 
    -#SBATCH --mem=160G             ## Assign 160G memory per node 
    - 
    -date 
    - 
    -module avail ansys 
    - 
    -module load ansys/19.2/default 
    -cd $TMPDIR 
    - 
    -set FLUENTNODES = "‘scontrol␣show␣hostnames‘" 
    -set FLUENTNODES = ‘echo $FLUENTNODES | tr ’ ’ ’,’‘ 
    - 
    -date 
    - 
    -srun fluent 3ddp \ 
    -        -g -t$SLURM_NTASKS \ 
    -        -g-cnf=$FLUENTNODES \ 
    -        -i $SLURM_SUBMIT_DIR/fluentdata/info.jou > call.txt 
    - 
    -date 
    - 
    -srun rsync -av $TMPDIR/ $SLURM_SUBMIT_DIR/fluentparallel/ 
    - 
    -date
    +
    #!/encs/bin/tcsh 
    + 
    +#SBATCH --job-name=flu10000    ## Give the job a name 
    +#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    +#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    +#SBATCH --chdir=./             ## Use currect directory as working directory 
    +#SBATCH --nodes=1              ## Number of nodes to run on 
    +#SBATCH --ntasks-per-node=32   ## Number of cores 
    +#SBATCH --cpus-per-task=1      ## Number of MPI threads 
    +#SBATCH --mem=160G             ## Assign 160G memory per node 
    + 
    +date 
    + 
    +module avail ansys 
    + 
    +module load ansys/19.2/default 
    +cd $TMPDIR 
    + 
    +set FLUENTNODES = "‘scontrol␣show␣hostnames‘" 
    +set FLUENTNODES = ‘echo $FLUENTNODES | tr ’ ’ ’,’‘ 
    + 
    +date 
    + 
    +srun fluent 3ddp \ 
    +        -g -t$SLURM_NTASKS \ 
    +        -g-cnf=$FLUENTNODES \ 
    +        -i $SLURM_SUBMIT_DIR/fluentdata/info.jou > call.txt 
    + 
    +date 
    + 
    +srun rsync -av $TMPDIR/ $SLURM_SUBMIT_DIR/fluentparallel/ 
    + 
    +date
     
    -
    Figure 10: Source code for fluent.sh
    +
    Figure 10: Source code for fluent.sh
    @@ -1505,7 +1604,7 @@

    Caveat: take care with journal-file file paths.

    -

    2.13 Example Job: efficientdet

    +

    2.13 Example Job: efficientdet

    The following steps describing how to create an efficientdet environment on Speed, were submitted by a member of Dr. Amer’s research group.

    @@ -1526,7 +1625,7 @@

    +
     pip install tensorflow==2.7.0
     pip install lxml>=4.6.1
     pip install absl-py>=0.10.0
    @@ -1545,7 +1644,7 @@ 

    -

    2.14 Java Jobs

    +

    2.14 Java Jobs

    Jobs that call java have a memory overhead, which needs to be taken into account when assigning a value to --mem. Even the most basic java call, java -Xmx1G -version, will need to have, --mem=5G, with the 4-GB difference representing the memory overhead. Note that this memory @@ -1554,7 +1653,7 @@

    2.14 314G.

    -

    2.15 Scheduling On The GPU Nodes

    +

    2.15 Scheduling On The GPU Nodes

    The primary cluster has two GPU nodes, each with six Tesla (CUDA-compatible) P6 cards: each card has 2048 cores and 16GB of RAM. Though note that the P6 is mainly a single-precision card, so unless you need the GPU double precision, double-precision calculations will be faster on a CPU @@ -1565,7 +1664,7 @@

    +
     #SBATCH --gpus=[1|2]
     

    @@ -1575,7 +1674,7 @@

    +
     sbatch -p pg ./<myscript>.sh
     

    @@ -1584,7 +1683,7 @@

    +
     ssh <username>@speed[-05|-17|37-43] nvidia-smi
     

    @@ -1593,7 +1692,7 @@

    +
     sinfo -p pg --long --Node
     

    @@ -1609,7 +1708,7 @@

    +
     [serguei@speed-submit src] % sinfo -p pg --long --Node
     Thu Oct 19 22:31:04 2023
     NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
    @@ -1637,7 +1736,7 @@ 

    +
     [serguei@speed-submit src] % squeue -p pg -o "%15N %.6D %7P %.11T %.4c %.8z %.6m %.8d %.6w %.8f %20G %20E"
     NODELIST         NODES PARTITI       STATE MIN_    S:C:T MIN_ME MIN_TMP_  WCKEY FEATURES GROUP DEPENDENCY
     speed-05             1 pg          RUNNING    1    *:*:*     1G        0 (null)   (null) 11929     (null)
    @@ -1650,7 +1749,7 @@ 

    -
    2.15.1 P6 on Multi-GPU, Multi-Node
    +
    2.15.1 P6 on Multi-GPU, Multi-Node

    As described lines above, P6 cards are not compatible with Distribute and DataParallel functions (Pytorch, Tensorflow) when running on Multi-GPUs. One workaround is to run the job in Multi-node, single GPU per node; per example: @@ -1658,7 +1757,7 @@

    +
     #SBATCH --nodes=2
     #SBATCH --gpus-per-node=1
     
    @@ -1668,7 +1767,7 @@

    -
    2.15.2 CUDA
    +
    2.15.2 CUDA

    When calling CUDA within job scripts, it is important to create a link to the desired CUDA libraries and set the runtime link path to the same libraries. For example, to use the cuda-11.5 libraries, specify the following in your Makefile. @@ -1676,7 +1775,7 @@

    2.15.2

    -
    +   
     -L/encs/pkg/cuda-11.5/root/lib64 -Wl,-rpath,/encs/pkg/cuda-11.5/root/lib64
     

    @@ -1684,14 +1783,14 @@

    2.15.2 load gcc/8.4 or module load gcc/9.3

    -
    2.15.3 Special Notes for sending CUDA jobs to the GPU Queue
    +
    2.15.3 Special Notes for sending CUDA jobs to the GPU Queue

    Interactive jobs (Section 2.8) must be submitted to the GPU partition in order to compile and link. We have several versions of CUDA installed in:

    -
    +   
     /encs/pkg/cuda-11.5/root/
     /encs/pkg/cuda-10.2/root/
     /encs/pkg/cuda-9.2/root
    @@ -1701,15 +1800,15 @@ 
    usrlocalcuda with one of the above.

    -
    2.15.4 OpenISS Examples
    +
    2.15.4 OpenISS Examples

    These represent more comprehensive research-like examples of jobs for computer vision and other tasks with a lot longer runtime (a subject to the number of epochs and other parameters) derive from the actual research works of students and their theses. These jobs require the use of CUDA and GPUs. These examples are available as “native” jobs on Speed and as Singularity containers.

    -

    OpenISS and REID - +

    OpenISS and REID + The example openiss-reid-speed.sh illustrates a job for a computer-vision based person re-identification (e.g., motion capture-based tracking for stage performance) part of the OpenISS project by Haotao Lai [10] using TensorFlow and Keras. The fork of the original repo [12] adjusted to @@ -1724,8 +1823,8 @@

    2.15 -

    OpenISS and YOLOv3 - +

    OpenISS and YOLOv3 + The related code using YOLOv3 framework is in the the fork of the original repo [11] adjusted to to run on Speed is here:

    @@ -1746,7 +1845,7 @@
    2.15
  • https://github.com/NAG-DevOps/speed-hpc/tree/master/src#openiss-yolov3
  • -

    2.16 Singularity Containers

    +

    2.16 Singularity Containers

    If the /encs software tree does not have a required software instantaneously available, another option is to run Singularity containers. We run EL7 flavor of Linux, and if some projects require Ubuntu or other distributions, there is a possibility to run that software as a container, including the ones @@ -1778,7 +1877,7 @@

    2

    -
    +   
     /speed-scratch/nag-public:
     
     openiss-cuda-conda-jupyter.sif
    @@ -1820,7 +1919,7 @@ 

    2

    -
    +   
     salloc --gpus=1 -n8 --mem=4Gb -t60
     cd /speed-scratch/$USER/
     singularity pull openiss-cuda-devicequery.sif docker://openiss/openiss-cuda-devicequery
    @@ -1833,14 +1932,14 @@ 

    2 example.

    -

    3 Conclusion

    +

    3 Conclusion

    The cluster is, “first come, first served”, until it fills, and then job position in the queue is based upon past usage. The scheduler does attempt to fill gaps, though, so sometimes a single-core job of lower priority will schedule before a multi-core job of higher priority, for example.

    -

    3.1 Important Limitations

    +

    3.1 Important Limitations

    • New users are restricted to a total of 32 cores: write to rt-ex-hpc@encs.concordia.ca if you need more temporarily (192 is the maximum, or, 6 jobs of 32 cores each). @@ -1871,7 +1970,7 @@

      3.

    -

    3.2 Tips/Tricks

    +

    3.2 Tips/Tricks

    • Files/scripts must have Linux line breaks in them (not Windows ones). Use file command to verify; and dos2unix command to convert. @@ -1894,7 +1993,7 @@

      3.2
    • E-mail, rt-ex-hpc AT encs.concordia.ca, with any concerns/questions.

    -

    3.3 Use Cases

    +

    3.3 Use Cases

    • HPC Committee’s initial batch about 6 students (end of 2019):

      @@ -1973,10 +2072,10 @@

      3.3

    -

    A History

    +

    A History

    -

    A.1 Acknowledgments

    +

    A.1 Acknowledgments

    • The first 6 (to 6.5) versions of this manual and early UGE job script samples, Singularity testing and user support were produced/done by Dr. Scott Bunnell during his time at @@ -1986,14 +2085,14 @@

      A.1
    • Dr. Tariq Daradkeh, was our IT Instructional Specialist August 2022 to September 2023; working on the scheduler, scheduling research, end user support, and integration - of examples, such as YOLOv3 in Section 2.15.4.0 other tasks. We have a continued + of examples, such as YOLOv3 in Section 2.15.4.0 other tasks. We have a continued collaboration on HPC/scheduling research.

    -

    A.2 Migration from UGE to SLURM

    +

    A.2 Migration from UGE to SLURM

    For long term users who started off with Grid Engine here are some resources to make a transition and mapping to the job submission process.

    @@ -2005,7 +2104,7 @@

    +
          GE  => SLURM
          s.q    ps
          g.q    pg
    @@ -2019,8 +2118,8 @@ 

    https://docs.alliancecan.ca/wiki/Running_jobs
    https://www.depts.ttu.edu/hpcc/userguides/general_guides/Conversion_Table_1.pdf
    https://docs.mpcdf.mpg.de/doc/computing/clusters/aux/migration-from-sge-to-slurm

    - PIC -
    Figure 11: Rosetta Mappings of Scheduler Commands from SchedMD
    + PIC +
    Figure 11: Rosetta Mappings of Scheduler Commands from SchedMD
  • @@ -2032,7 +2131,7 @@

    +
          # Speed environment set up
          if ($HOSTNAME == speed-submit.encs.concordia.ca) then
             source /local/pkg/uge-8.6.3/root/default/common/settings.csh
    @@ -2044,7 +2143,7 @@ 

    +
          # Speed environment set up
          if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then
              . /local/pkg/uge-8.6.3/root/default/common/settings.sh
    @@ -2058,21 +2157,21 @@ 

    -

    A.3 Phases

    +

    A.3 Phases

    Brief summary of Speed evolution phases.

    -
    A.3.1 Phase 4
    +
    A.3.1 Phase 4

    Phase 4 had 7 SuperMicro servers with 4x A100 80GB GPUs each added, dubbed as “SPEED2”. We also moved from Grid Engine to SLURM.

    -
    A.3.2 Phase 3
    +
    A.3.2 Phase 3

    Phase 3 had 4 vidpro nodes added from Dr. Amer totalling 6x P6 and 6x V100 GPUs added.

    -
    A.3.3 Phase 2
    +
    A.3.3 Phase 2

    Phase 2 saw 6x NVIDIA Tesla P6 added and 8x more compute nodes. The P6s replaced 4x of FirePro S7150. @@ -2080,7 +2179,7 @@

    A.3.3

    -
    A.3.4 Phase 1
    +
    A.3.4 Phase 1

    Phase 1 of Speed was of the following configuration:

      @@ -2091,20 +2190,20 @@
      A.3.4

    -

    B Frequently Asked Questions

    +

    B Frequently Asked Questions

    -

    B.1 Where do I learn about Linux?

    +

    B.1 Where do I learn about Linux?

    All Speed users are expected to have a basic understanding of Linux and its commonly used commands.

    -
    Software Carpentry
    +
    Software Carpentry

    Software Carpentry provides free resources to learn software, including a workshop on the Unix shell. https://software-carpentry.org/lessons/

    -
    Udemy
    +
    Udemy

    There are a number of Udemy courses, including free ones, that will assist you in learning Linux. Active Concordia faculty, staff and students have access to Udemy courses. The course Linux Mastery: Master the Linux Command Line in 11.5 Hours is a good starting point for @@ -2115,25 +2214,25 @@

    Udemy

    -

    B.2 How to use the “bash shell” on Speed?

    +

    B.2 How to use the “bash shell” on Speed?

    This section describes how to use the “bash shell” on Speed. Review Section 2.1.2 to ensure that your bash environment is set up.

    -
    B.2.1 How do I set bash as my login shell?
    +
    B.2.1 How do I set bash as my login shell?

    In order to set your default login shell to bash on Speed, your login shell on all GCS servers must be changed to bash. To make this change, create a ticket with the Service Desk (or email help at concordia.ca) to request that bash become your default login shell for your ENCS user account on all GCS servers.

    -
    B.2.2 How do I move into a bash shell on Speed?
    +
    B.2.2 How do I move into a bash shell on Speed?

    To move to the bash shell, type bash at the command prompt. For example:

    -
    +   
     [speed-submit] [/home/a/a_user] > bash
     bash-4.4$ echo $0
     bash
    @@ -2143,7 +2242,7 @@ 
    bash-4.4$ after entering the bash shell.

    -
    B.2.3 How do I use the bash shell in an interactive session on Speed?
    +
    B.2.3 How do I use the bash shell in an interactive session on Speed?

    Below are examples of how to use bash as a shell in your interactive job sessions with both the salloc and srun commands.

    @@ -2153,41 +2252,41 @@
    srun --mem=50G -n 5 --pty /encs/bin/bash

  • Note: Make sure the interactive job requests memory, cores, etc.

    -
    B.2.4 How do I run scripts written in bash on Speed?
    +
    B.2.4 How do I run scripts written in bash on Speed?

    To execute bash scripts on Speed:

      -
    1. Ensure that the shebang of your bash job script is #!/encs/bin/bash +
    2. Ensure that the shebang of your bash job script is #!/encs/bin/bash
    3. -
    4. Use the sbatch command to submit your job script to the scheduler.
    +
  • Use the sbatch command to submit your job script to the scheduler.
  • The Speed GitHub contains a sample bash job script.

    -

    B.3 How to resolve “Disk quota exceeded” errors?

    +

    B.3 How to resolve “Disk quota exceeded” errors?

    -
    B.3.1 Probable Cause
    +
    B.3.1 Probable Cause

    The “Disk quota exceeded” Error occurs when your application has run out of disk space to write to. On Speed this error can be returned when:

      -
    1. Your NFS-provided home is full and cannot be written to. You can verify this using quota +
    2. Your NFS-provided home is full and cannot be written to. You can verify this using quota and bigfiles commands.
    3. -
    4. The /tmp directory on the speed node your application is running on is full and cannot +
    5. The /tmp directory on the speed node your application is running on is full and cannot be written to.

    -
    B.3.2 Possible Solutions
    +
    B.3.2 Possible Solutions

      -
    1. Use the --chdir job script option to set the directory that the job script is submitted +
    2. Use the --chdir job script option to set the directory that the job script is submitted from the job working directory. The job working directory is the directory that the job will write output files in.
    3. -
    4. +
    5. The use local disk space is generally recommended for IO intensive operations. However, as the size of /tmp on speed nodes is 1TB it can be necessary for scripts to store temporary data elsewhere. Review the documentation for each module called within your script to determine @@ -2208,7 +2307,7 @@

      B.

      -
      +         
                mkdir -m 750 /speed-scratch/$USER/output
                 
       
      @@ -2220,7 +2319,7 @@
      B.

      -
      +         
                mkdir -m 750 /speed-scratch/$USER/recovery
       

      @@ -2231,7 +2330,7 @@

      B.

      In the above example, $USER is an environment variable containing your ENCS username.

      -
      B.3.3 Example of setting working directories for COMSOL
      +
      B.3.3 Example of setting working directories for COMSOL
      • Create directories for recovery, temporary, and configuration files. For example, to create these @@ -2240,7 +2339,7 @@

        +
              mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config}
         

        @@ -2252,7 +2351,7 @@

        +
              -recoverydir /speed-scratch/$USER/comsol/recovery
              -tmpdir /speed-scratch/$USER/comsol/tmp
              -configuration/speed-scratch/$USER/comsol/config
        @@ -2261,7 +2360,7 @@ 
        In the above example, $USER is an environment variable containing your ENCS username.

        -
        B.3.4 Example of setting working directories for Python Modules
        +
        B.3.4 Example of setting working directories for Python Modules

        By default when adding a python module the /tmp directory is set as the temporary repository for files downloads. The size of the /tmp directory on speed-submit is too small for pytorch. To add a python module

        @@ -2272,7 +2371,7 @@
        +
                mkdir /speed-scratch/$USER/tmp
         

        @@ -2283,7 +2382,7 @@

        +
                setenv TMPDIR /speed-scratch/$USER/tmp
         

        @@ -2292,17 +2391,17 @@

        In the above example, $USER is an environment variable containing your ENCS username.

        -

        B.4 How do I check my job’s status?

        +

        B.4 How do I check my job’s status?

        When a job with a job id of 1234 is running or terminated, the status of that job can be tracked using ‘sacct -j 1234’. squeue -j 1234 can show while the job is sitting in the queue as well. Long term statistics on the job after its terminated can be found using sstat -j 1234 after slurmctld purges it its tracking state into the database.

        -

        B.5 Why is my job pending when nodes are empty?

        +

        B.5 Why is my job pending when nodes are empty?

        -
        B.5.1 Disabled nodes
        +
        B.5.1 Disabled nodes

        It is possible that one or a number of the Speed nodes are disabled. Nodes are disabled if they require maintenance. To verify if Speed nodes are disabled, see if they are in a draining or drained state: @@ -2310,7 +2409,7 @@

        B.5.1

        -
        +   
         [serguei@speed-submit src] % sinfo --long --Node
         Thu Oct 19 21:25:12 2023
         NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
        @@ -2359,7 +2458,7 @@ 
        B.5.1 and the disabled nodes have a state of idle.

        -
        B.5.2 Error in job submit request.
        +
        B.5.2 Error in job submit request.

        It is possible that your job is pending, because the job requested resources that are not available within Speed. To verify why job id 1234 is not running, execute ‘sacct -j 1234’. A summary of the reasons is available via the squeue command. @@ -2368,7 +2467,7 @@

        C Sister Facilities
        +

        C Sister Facilities

        Below is a list of resources and facilities similar to Speed at various capacities. Depending on your research group and needs, they might be available to you. They are not managed by HPC/NAG of AITS, so contact their respective representatives. @@ -2423,8 +2522,8 @@

        C -

        References

        +

        +

        References