diff --git a/doc/images/vscode.png b/doc/images/vscode.png new file mode 100644 index 0000000..067b9bd Binary files /dev/null and b/doc/images/vscode.png differ diff --git a/doc/scheduler-scripting.tex b/doc/scheduler-scripting.tex index d695d89..99f9f33 100644 --- a/doc/scheduler-scripting.tex +++ b/doc/scheduler-scripting.tex @@ -534,7 +534,11 @@ \subsubsection{Graphical Applications} Once landed on a compute node, verify \api{DISPLAY} again. \item -While running under scheduler, unset \api{XDG\_RUNTIME\_DIR}. +While running under scheduler, create a run-user directory and set the variable \api{XDG\_RUNTIME\_DIR}. +\begin{verbatim} +mkdir -p /speed-scratch/$USER/run-dir +setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir +\end{verbatim} \item Launch your graphical application: @@ -545,7 +549,9 @@ \subsubsection{Graphical Applications} Here's an example of starting PyCharm (see \xf{fig:pycharm}), of which we made a sample local installation. You can make a similar install under your own directory. If using VSCode, it's -currently only supported with the \tool{--no-sandbox} option. +currently only supported with the \tool{--no-sandbox} option. \newline + +BASH version: \scriptsize \begin{verbatim} @@ -553,16 +559,36 @@ \subsubsection{Graphical Applications} serguei@speed's password: [serguei@speed-submit ~] % echo $DISPLAY localhost:14.0 -[serguei@speed-submit ~] % srun -p ps --pty --x11=first --mem 4000 -t 0-06:00 /encs/bin/bash +[serguei@speed-submit ~] % salloc -p ps --x11=first --mem=4Gb -t 0-06:00 bash-4.4$ echo $DISPLAY localhost:77.0 bash-4.4$ hostname speed-01.encs.concordia.ca -bash-4.4$ unset XDG_RUNTIME_DIR +bash-4.4$ export XDG_RUNTIME_DIR=/speed-scratch/$USER/run-dir bash-4.4$ /speed-scratch/nag-public/bin/pycharm.sh \end{verbatim} \normalsize +TCSH version: +\scriptsize +\begin{verbatim} +ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too) +[speed-submit] [/home/c/carlos] > echo $DISPLAY +localhost:14.0 +[speed-submit] [/home/c/carlos] > cd /speed-scratch/$USER +[speed-submit] [/speed-scratch/carlos] > echo $DISPLAY +localhost:13.0 +[speed-submit] [/speed-scratch/carlos] > salloc -pps --x11=first --mem=4Gb -t 0-06:00 +[speed-07] [/speed-scratch/carlos] > echo $DISPLAY +localhost:42.0 +[speed-07] [/speed-scratch/carlos] > hostname +speed-07.encs.concordia.ca +[speed-07] [/speed-scratch/carlos] > setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir +[speed-07] [/speed-scratch/carlos] > /speed-scratch/nag-public/bin/pycharm.sh +\end{verbatim} +\normalsize + + \begin{figure}[htpb] \includegraphics[width=\columnwidth]{images/pycharm} \caption{PyCharm Starting up on a Speed Node} @@ -645,6 +671,65 @@ \subsubsection{Jupyter Notebooks} \label{fig:jupyter} \end{figure} +% ------------------------------------------------------------------------------ +\subsubsection{VScode} +\label{sect:vscode} + +This is an example of running VScode, it's similar to Jupyter notebooks, but it doesn't use containers. +This a Web version, it exists the local(workstation)-remote(speed-node) version too, but it is for Advanced users (no support, execute it at your own risk). + +\begin{itemize} +\item +Environment preparation: for the FIRST time: +\begin{enumerate} +\item +Go to your speed-scratch directory: \texttt{cd /speed-scratch/\$USER} +\item +Create a vscode directory: \texttt{mkdir vscode} +\item +Go to vscode: \texttt{cd vscode} +\item +Create home and projects: \texttt {mkdir \{home,projects\}} +\item +Create this directory: \texttt {mkdir -p /speed-scratch/\$USER/run-user} +\end{enumerate} +\item +Running VScode +\begin{enumerate} +\item +Go to your vscode directory: \texttt{cd /speed-scratch/\$USER/vscode} +\item +Open interactive session: \texttt {salloc --mem=10Gb --constraint=el9} +\item +Set environment variable: \texttt {setenv XDG\_RUNTIME\_DIR /speed-scratch/\$USER/run-user} +\item +Run VScode, change the port if needed. +\scriptsize +\begin{verbatim} +/speed-scratch/nag-public/code-server-4.22.1/bin/code-server --user-data-dir=$PWD\/projects \ +--config=$PWD\/home/.config/code-server/config.yaml --bind-addr="0.0.0.0:8080" $PWD\/projects +\end{verbatim} +\normalsize +\item +Tunnel ssh creation: similar to Jupyter, see \xs{sect:jupyter} +\item +Open a browser and type: \texttt {localhost:8080} +\item +If the browser asks for password: +\begin{verbatim} +cat /speed-scratch/$USER/vscode/home/.config/code-server/config.yaml +\end{verbatim} + +\end{enumerate} +\end{itemize} + +\begin{figure}[htbp] + \centering + \fbox{\includegraphics[width=1.00\textwidth]{images/vscode.png}} + \caption{VScode running on a Speed node} + \label{fig:vscode} +\end{figure} + % ------------------------------------------------------------------------------ \subsection{Scheduler Environment Variables} \label{sect:env-vars} diff --git a/doc/speed-manual.pdf b/doc/speed-manual.pdf index 6660467..968017e 100644 Binary files a/doc/speed-manual.pdf and b/doc/speed-manual.pdf differ diff --git a/doc/web/images/vscode.png b/doc/web/images/vscode.png new file mode 100644 index 0000000..067b9bd Binary files /dev/null and b/doc/web/images/vscode.png differ diff --git a/doc/web/index.html b/doc/web/index.html index ee80e67..99fe53e 100644 --- a/doc/web/index.html +++ b/doc/web/index.html @@ -68,52 +68,53 @@

Contents


  2.8.1 Command Line
  2.8.2 Graphical Applications
  2.8.3 Jupyter Notebooks -
 2.9 Scheduler Environment Variables -
 2.10 SSH Keys For MPI -
 2.11 Creating Virtual Environments -
  2.11.1 Anaconda -
  2.11.2 Python -
 2.12 Example Job Script: Fluent -
 2.13 Example Job: efficientdet -
 2.14 Java Jobs -
 2.15 Scheduling On The GPU Nodes -
  2.15.1 P6 on Multi-GPU, Multi-Node -
  2.15.2 CUDA -
  2.15.3 Special Notes for sending CUDA jobs to the GPU Queue -
  2.15.4 OpenISS Examples -
 2.16 Singularity Containers -
3 Conclusion -
 3.1 Important Limitations -
 3.2 Tips/Tricks -
 3.3 Use Cases -
A History -
 A.1 Acknowledgments -
 A.2 Migration from UGE to SLURM -
 A.3 Phases -
  A.3.1 Phase 4 -
  A.3.2 Phase 3 -
  A.3.3 Phase 2 -
  A.3.4 Phase 1 -
B Frequently Asked Questions -
 B.1 Where do I learn about Linux? -
 B.2 How to use the “bash shell” on Speed? -
  B.2.1 How do I set bash as my login shell? -
  B.2.2 How do I move into a bash shell on Speed? -
  B.2.3 How do I use the bash shell in an interactive session on Speed? -
  B.2.4 How do I run scripts written in bash on Speed? -
 B.3 How to resolve “Disk quota exceeded” errors? -
  B.3.1 Probable Cause -
  B.3.2 Possible Solutions -
  B.3.3 Example of setting working directories for COMSOL -
  B.3.4 Example of setting working directories for Python Modules -
 B.4 How do I check my job’s status? -
 B.5 Why is my job pending when nodes are empty? -
  B.5.1 Disabled nodes -
  B.5.2 Error in job submit request. - - - -
C Sister Facilities +
  2.8.4 VScode +
 2.9 Scheduler Environment Variables +
 2.10 SSH Keys For MPI +
 2.11 Creating Virtual Environments +
  2.11.1 Anaconda +
  2.11.2 Python +
 2.12 Example Job Script: Fluent +
 2.13 Example Job: efficientdet +
 2.14 Java Jobs +
 2.15 Scheduling On The GPU Nodes +
  2.15.1 P6 on Multi-GPU, Multi-Node +
  2.15.2 CUDA +
  2.15.3 Special Notes for sending CUDA jobs to the GPU Queue +
  2.15.4 OpenISS Examples +
 2.16 Singularity Containers +
3 Conclusion +
 3.1 Important Limitations +
 3.2 Tips/Tricks +
 3.3 Use Cases +
A History +
 A.1 Acknowledgments +
 A.2 Migration from UGE to SLURM +
 A.3 Phases +
  A.3.1 Phase 4 +
  A.3.2 Phase 3 +
  A.3.3 Phase 2 +
  A.3.4 Phase 1 +
B Frequently Asked Questions +
 B.1 Where do I learn about Linux? +
 B.2 How to use the “bash shell” on Speed? +
  B.2.1 How do I set bash as my login shell? +
  B.2.2 How do I move into a bash shell on Speed? +
  B.2.3 How do I use the bash shell in an interactive session on Speed? +
  B.2.4 How do I run scripts written in bash on Speed? +
 B.3 How to resolve “Disk quota exceeded” errors? +
  B.3.1 Probable Cause +
  B.3.2 Possible Solutions +
  B.3.3 Example of setting working directories for COMSOL +
  B.3.4 Example of setting working directories for Python Modules +
 B.4 How do I check my job’s status? +
 B.5 Why is my job pending when nodes are empty? +
  B.5.1 Disabled nodes + + + +
  B.5.2 Error in job submit request. +
C Sister Facilities
Annotated Bibliography @@ -994,32 +995,65 @@
Once landed on a compute node, verify DISPLAY again. -
  • While running under scheduler, unset XDG_RUNTIME_DIR. -
  • +
  • +

    While running under scheduler, create a run-user directory and set the variable + XDG_RUNTIME_DIR. + + + +

    +
    +     mkdir -p /speed-scratch/$USER/run-dir
    +     setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir
    +
    +

    +

  • -

    Launch your graphical application: -

    module load the required version, then matlab, or abaqus cme, etc.

  • -

    Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local +

    Launch your graphical application: +

    module load the required version, then matlab, or abaqus cme, etc.

    +

    Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local installation. You can make a similar install under your own directory. If using VSCode, it’s currently -only supported with the --no-sandbox option. +only supported with the --no-sandbox option.
    +

    BASH version:

    -
    +   
     bash-3.2$ ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too)
     serguei@speed’s password:
     [serguei@speed-submit ~] % echo $DISPLAY
     localhost:14.0
    -[serguei@speed-submit ~] % srun -p ps --pty --x11=first --mem 4000 -t 0-06:00 /encs/bin/bash
    +[serguei@speed-submit ~] % salloc -p ps --x11=first --mem=4Gb -t 0-06:00
     bash-4.4$ echo $DISPLAY
     localhost:77.0
     bash-4.4$ hostname
     speed-01.encs.concordia.ca
    -bash-4.4$ unset XDG_RUNTIME_DIR
    +bash-4.4$ export XDG_RUNTIME_DIR=/speed-scratch/$USER/run-dir
     bash-4.4$ /speed-scratch/nag-public/bin/pycharm.sh
     
    -

    +

    +

    TCSH version: + + + +

    +
    +ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too)
    +[speed-submit] [/home/c/carlos] > echo $DISPLAY
    +localhost:14.0
    +[speed-submit] [/home/c/carlos] > cd /speed-scratch/$USER
    +[speed-submit] [/speed-scratch/carlos] > echo $DISPLAY
    +localhost:13.0
    +[speed-submit] [/speed-scratch/carlos] > salloc -pps --x11=first --mem=4Gb -t 0-06:00
    +[speed-07] [/speed-scratch/carlos] > echo $DISPLAY
    +localhost:42.0
    +[speed-07] [/speed-scratch/carlos] > hostname
    +speed-07.encs.concordia.ca
    +[speed-07] [/speed-scratch/carlos] > setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir
    +[speed-07] [/speed-scratch/carlos] > /speed-scratch/nag-public/bin/pycharm.sh
    +
    +

    @@ -1030,7 +1064,7 @@
    PIC +

    PIC

    Figure 4: PyCharm Starting up on a Speed Node
    @@ -1038,54 +1072,54 @@
    2.8.3 Jupyter Notebooks
    -

    This is an example of running Jupyter notebooks together with Singularity (more on Singularity see +

    This is an example of running Jupyter notebooks together with Singularity (more on Singularity see Section 2.16). Here we are using one of the OpenISS-derived containers (see Section 2.15.4 as well). -

    +

    1. Use the --x11 with salloc or srun as described in the above example
    2. Load Singularity module module load singularity/3.10.4/default
    3. -

      Execute this Singularity command on a single line. It’s best to save it in a shell script that you +

      Execute this Singularity command on a single line. It’s best to save it in a shell script that you could call, since it’s long.

      -
      +     
            srun singularity exec -B $PWD\:/speed-pwd,/speed-scratch/$USER\:/my-speed-scratch,/nettemp \
             --env SHELL=/bin/bash --nv /speed-scratch/nag-public/openiss-cuda-conda-jupyter.sif \
             /bin/bash -c ’/opt/conda/bin/jupyter notebook --no-browser --notebook-dir=/speed-pwd \
             --ip="*" --port=8888 --allow-root’
       
      -

      +

    4. -

      Create an ssh tunnel between your computer and the node (speed-XX) where Jupyter is +

      Create an ssh tunnel between your computer and the node (speed-XX) where Jupyter is running (Using speed-submit as a “jump server”) (Preferably: PuTTY, see Figure 5 and Figure 6)

      -
      +     
            ssh -L 8888:speed-XX:8888 YOUR_USER@speed-submit.encs.concordia.ca
       
      -

      Don’t close the tunnel. +

      Don’t close the tunnel.

    5. -

      Open a browser, and copy your Jupyter’s token, in the screenshot example in Figure 7; each +

      Open a browser, and copy your Jupyter’s token, in the screenshot example in Figure 7; each time the token will be different, as it printed to you in the terminal.

      -
      +     
            http://localhost:8888/?token=5a52e6c0c7dfc111008a803e5303371ed0462d3d547ac3fb
       
      -

      +

    6. Work with your notebook.
    @@ -1134,8 +1168,84 @@
    2.8
    -

    2.9 Scheduler Environment Variables

    -

    The scheduler presents a number of environment variables that can be used in your jobs. You can +

    2.8.4 VScode
    +

    This is an example of running VScode, it’s similar to Jupyter notebooks, but it doesn’t use containers. +This a Web version, it exists the local(workstation)-remote(speed-node) version too, but it is for +Advanced users (no support, execute it at your own risk). +

    +
      +
    • +

      Environment preparation: for the FIRST time: +

        +
      1. Go to your speed-scratch directory: cd /speed-scratch/$USER +
      2. +
      3. Create a vscode directory: mkdir vscode +
      4. +
      5. Go to vscode: cd vscode +
      6. +
      7. Create home and projects: mkdir {home,projects} +
      8. +
      9. Create this directory: mkdir -p /speed-scratch/$USER/run-user
      +
    • +
    • +

      Running VScode +

        +
      1. Go to your vscode directory: cd /speed-scratch/$USER/vscode +
      2. +
      3. Open interactive session: salloc --mem=10Gb --constraint=el9 + + + +
      4. +
      5. Set environment variable: setenv XDG_RUNTIME_DIR + /speed-scratch/$USER/run-user +
      6. +
      7. +

        Run VScode, change the port if needed. + + + +

        +
        +         /speed-scratch/nag-public/code-server-4.22.1/bin/code-server --user-data-dir=$PWD\/projects \
        +         --config=$PWD\/home/.config/code-server/config.yaml --bind-addr="0.0.0.0:8080" $PWD\/projects
        +
        +

        +

      8. +
      9. Tunnel ssh creation: similar to Jupyter, see Section 2.8.3 +
      10. +
      11. Open a browser and type: localhost:8080 +
      12. +
      13. +

        If the browser asks for password: + + + +

        +
        +         cat /speed-scratch/$USER/vscode/home/.config/code-server/config.yaml
        +
        +

        +

        +
      +
    +
    + + + + + + + + +
    PIC
    +
    Figure 8: VScode running on a Speed node
    + + + +
    +

    2.9 Scheduler Environment Variables

    +

    The scheduler presents a number of environment variables that can be used in your jobs. You can invoke env or printenv in your job to know what hose are (most begin with the prefix SLURM). Some of the more useful ones are:

    @@ -1155,48 +1265,48 @@

    $SLURM_ARRAY_TASK_ID=for array jobs (see Section 2.6).
  • -

    See a more complete list here: +

    See a more complete list here:

  • -

    In Figure 8 is a sample script, using some of these. +

    In Figure 9 is a sample script, using some of these.

    - + -
    #!/encs/bin/tcsh 
    - 
    -#SBATCH --job-name=tmpdir      ## Give the job a name 
    -#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    -#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    -#SBATCH --chdir=./             ## Use currect directory as working directory 
    -#SBATCH --nodes=1 
    -#SBATCH --ntasks=1 
    -#SBATCH --cpus-per-task=8      ## Request 8 cores 
    -#SBATCH --mem=32G              ## Assign 32G memory per node 
    - 
    -cd $TMPDIR 
    -mkdir input 
    -rsync -av $SLURM_SUBMIT_DIR/references/ input/ 
    -mkdir results 
    -srun STAR --inFiles $TMPDIR/input --parallel $SRUN_CPUS_PER_TASK --outFiles $TMPDIR/results 
    -rsync -av $TMPDIR/results/ $SLURM_SUBMIT_DIR/processed/
    +
    #!/encs/bin/tcsh 
    + 
    +#SBATCH --job-name=tmpdir      ## Give the job a name 
    +#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    +#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    +#SBATCH --chdir=./             ## Use currect directory as working directory 
    +#SBATCH --nodes=1 
    +#SBATCH --ntasks=1 
    +#SBATCH --cpus-per-task=8      ## Request 8 cores 
    +#SBATCH --mem=32G              ## Assign 32G memory per node 
    + 
    +cd $TMPDIR 
    +mkdir input 
    +rsync -av $SLURM_SUBMIT_DIR/references/ input/ 
    +mkdir results 
    +srun STAR --inFiles $TMPDIR/input --parallel $SRUN_CPUS_PER_TASK --outFiles $TMPDIR/results 
    +rsync -av $TMPDIR/results/ $SLURM_SUBMIT_DIR/processed/
     
    -
    Figure 8: Source code for tmpdir.sh
    +
    Figure 9: Source code for tmpdir.sh
    -

    2.10 SSH Keys For MPI

    +

    2.10 SSH Keys For MPI

    Some programs effect their parallel processing via MPI (which is a communication protocol). An example of such software is Fluent. MPI needs to have ‘passwordless login’ set up, which means SSH keys. In your NFS-mounted home directory: @@ -1214,7 +1324,7 @@

    2.10 permissions by default).

    -

    2.11 Creating Virtual Environments

    +

    2.11 Creating Virtual Environments

    The following documentation is specific to the Speed HPC Facility at the Gina Cody School of Engineering and Computer Science. Virtual environments typically instantiated via Conda or Python. Another option is Singularity detailed in Section 2.16. Usually, virtual environments are created once @@ -1223,7 +1333,7 @@

    -
    2.11.1 Anaconda
    +
    2.11.1 Anaconda

    Request an interactive session in the queue you wish to submit your jobs to (e.g., salloc -p pg –gpus=1 for GPU jobs). Once your interactive has started, create an anaconda environment in your speed-scratch directory by using the prefix option when executing conda create. For example, @@ -1236,7 +1346,7 @@

    2.11.1 -
    +   
     module load anaconda3/2023.03/default
     conda create --prefix /speed-scratch/a_user/myconda
     
    @@ -1244,13 +1354,13 @@
    2.11.1

    Note: Without the prefix option, the conda create command creates the environment in a_user’s home directory by default.

    -

    List Environments. +

    List Environments. To view your conda environments, type: conda info --envs

    -
    +   
     # conda environments:
     #
     base                  *  /encs/pkg/anaconda3-2023.03/root
    @@ -1258,13 +1368,13 @@ 
    2.11.1

    -

    Activate an Environment. +

    Activate an Environment. Activate the environment speedscratcha_usermyconda as follows

    -
    +   
     conda activate /speed-scratch/a_user/myconda
     

    After activating your environment, add pip to your environment by using @@ -1272,7 +1382,7 @@

    2.11.1 -
    +   
     conda install pip
     

    This will install pip and pip’s dependencies, including python, into the environment. @@ -1284,7 +1394,7 @@

    2.11.1 -
    +     
          salloc -p pg --gpus=1 --mem=10GB -A <slurm account name>
          cd /speed-scratch/$USER
          module load python/3.11.0/default
    @@ -1304,7 +1414,7 @@ 
    2.11.1 conda install installs modules from anaconda’s repository.

    -
    2.11.2 Python
    +
    2.11.2 Python

    Setting up a Python virtual environment is fairly straightforward. The first step is to request an interactive session in the queue you wish to submit your jobs to.

    We have a simple example that use a Python virtual environment: @@ -1316,7 +1426,7 @@

    2.11.2 -
    +     
          salloc -p pg --gpus=1 --mem=10GB -A <slurm account name>
          cd /speed-scratch/$USER
          module load python/3.9.1/default
    @@ -1336,56 +1446,56 @@ 
    2.11.2 --gpus= when preparing environments for CPU jobs.

    -

    2.12 Example Job Script: Fluent

    +

    2.12 Example Job Script: Fluent

    - - - - -
    #!/encs/bin/tcsh 
    - 
    -#SBATCH --job-name=flu10000    ## Give the job a name 
    -#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    -#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    -#SBATCH --chdir=./             ## Use currect directory as working directory 
    -#SBATCH --nodes=1              ## Number of nodes to run on 
    -#SBATCH --ntasks-per-node=32   ## Number of cores 
    -#SBATCH --cpus-per-task=1      ## Number of MPI threads 
    -#SBATCH --mem=160G             ## Assign 160G memory per node 
    - 
    -date 
    - 
    -module avail ansys 
    - 
    -module load ansys/19.2/default 
    -cd $TMPDIR 
    - 
    -set FLUENTNODES = "‘scontrol␣show␣hostnames‘" 
    -set FLUENTNODES = ‘echo $FLUENTNODES | tr ’ ’ ’,’‘ 
    - 
    -date 
    - 
    -srun fluent 3ddp \ 
    -        -g -t$SLURM_NTASKS \ 
    -        -g-cnf=$FLUENTNODES \ 
    -        -i $SLURM_SUBMIT_DIR/fluentdata/info.jou > call.txt 
    - 
    -date 
    - 
    -srun rsync -av $TMPDIR/ $SLURM_SUBMIT_DIR/fluentparallel/ 
    - 
    -date
    +
    +                                                                               
    +
    +                                                                               
    +
    #!/encs/bin/tcsh 
    + 
    +#SBATCH --job-name=flu10000    ## Give the job a name 
    +#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    +#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    +#SBATCH --chdir=./             ## Use currect directory as working directory 
    +#SBATCH --nodes=1              ## Number of nodes to run on 
    +#SBATCH --ntasks-per-node=32   ## Number of cores 
    +#SBATCH --cpus-per-task=1      ## Number of MPI threads 
    +#SBATCH --mem=160G             ## Assign 160G memory per node 
    + 
    +date 
    + 
    +module avail ansys 
    + 
    +module load ansys/19.2/default 
    +cd $TMPDIR 
    + 
    +set FLUENTNODES = "‘scontrol␣show␣hostnames‘" 
    +set FLUENTNODES = ‘echo $FLUENTNODES | tr ’ ’ ’,’‘ 
    + 
    +date 
    + 
    +srun fluent 3ddp \ 
    +        -g -t$SLURM_NTASKS \ 
    +        -g-cnf=$FLUENTNODES \ 
    +        -i $SLURM_SUBMIT_DIR/fluentdata/info.jou > call.txt 
    + 
    +date 
    + 
    +srun rsync -av $TMPDIR/ $SLURM_SUBMIT_DIR/fluentparallel/ 
    + 
    +date
     
    -
    Figure 9: Source code for fluent.sh
    +
    Figure 10: Source code for fluent.sh
    -

    The job script in Figure 9 runs Fluent in parallel over 32 cores. Of note, we have requested +

    The job script in Figure 10 runs Fluent in parallel over 32 cores. Of note, we have requested e-mail notifications (--mail-type), are defining the parallel environment for, fluent, with, -t$SLURM_NTASKS and -g-cnf=$FLUENTNODES (very important), and are setting $TMPDIR as the in-job location for the “moment” rfile.out file (in-job, because the last line of the @@ -1395,7 +1505,7 @@

    Caveat: take care with journal-file file paths.

    -

    2.13 Example Job: efficientdet

    +

    2.13 Example Job: efficientdet

    The following steps describing how to create an efficientdet environment on Speed, were submitted by a member of Dr. Amer’s research group.

    @@ -1416,7 +1526,7 @@

    +
     pip install tensorflow==2.7.0
     pip install lxml>=4.6.1
     pip install absl-py>=0.10.0
    @@ -1435,7 +1545,7 @@ 

    -

    2.14 Java Jobs

    +

    2.14 Java Jobs

    Jobs that call java have a memory overhead, which needs to be taken into account when assigning a value to --mem. Even the most basic java call, java -Xmx1G -version, will need to have, --mem=5G, with the 4-GB difference representing the memory overhead. Note that this memory @@ -1444,7 +1554,7 @@

    2.14 314G.

    -

    2.15 Scheduling On The GPU Nodes

    +

    2.15 Scheduling On The GPU Nodes

    The primary cluster has two GPU nodes, each with six Tesla (CUDA-compatible) P6 cards: each card has 2048 cores and 16GB of RAM. Though note that the P6 is mainly a single-precision card, so unless you need the GPU double precision, double-precision calculations will be faster on a CPU @@ -1455,7 +1565,7 @@

    +
     #SBATCH --gpus=[1|2]
     

    @@ -1465,7 +1575,7 @@

    +
     sbatch -p pg ./<myscript>.sh
     

    @@ -1474,7 +1584,7 @@

    +
     ssh <username>@speed[-05|-17|37-43] nvidia-smi
     

    @@ -1483,7 +1593,7 @@

    +
     sinfo -p pg --long --Node
     

    @@ -1499,7 +1609,7 @@

    +
     [serguei@speed-submit src] % sinfo -p pg --long --Node
     Thu Oct 19 22:31:04 2023
     NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
    @@ -1527,7 +1637,7 @@ 

    +
     [serguei@speed-submit src] % squeue -p pg -o "%15N %.6D %7P %.11T %.4c %.8z %.6m %.8d %.6w %.8f %20G %20E"
     NODELIST         NODES PARTITI       STATE MIN_    S:C:T MIN_ME MIN_TMP_  WCKEY FEATURES GROUP DEPENDENCY
     speed-05             1 pg          RUNNING    1    *:*:*     1G        0 (null)   (null) 11929     (null)
    @@ -1540,7 +1650,7 @@ 

    -
    2.15.1 P6 on Multi-GPU, Multi-Node
    +
    2.15.1 P6 on Multi-GPU, Multi-Node

    As described lines above, P6 cards are not compatible with Distribute and DataParallel functions (Pytorch, Tensorflow) when running on Multi-GPUs. One workaround is to run the job in Multi-node, single GPU per node; per example: @@ -1548,7 +1658,7 @@

    +
     #SBATCH --nodes=2
     #SBATCH --gpus-per-node=1
     
    @@ -1558,7 +1668,7 @@

    -
    2.15.2 CUDA
    +
    2.15.2 CUDA

    When calling CUDA within job scripts, it is important to create a link to the desired CUDA libraries and set the runtime link path to the same libraries. For example, to use the cuda-11.5 libraries, specify the following in your Makefile. @@ -1566,7 +1676,7 @@

    2.15.2

    -
    +   
     -L/encs/pkg/cuda-11.5/root/lib64 -Wl,-rpath,/encs/pkg/cuda-11.5/root/lib64
     

    @@ -1574,14 +1684,14 @@

    2.15.2 load gcc/8.4 or module load gcc/9.3

    -
    2.15.3 Special Notes for sending CUDA jobs to the GPU Queue
    +
    2.15.3 Special Notes for sending CUDA jobs to the GPU Queue

    Interactive jobs (Section 2.8) must be submitted to the GPU partition in order to compile and link. We have several versions of CUDA installed in:

    -
    +   
     /encs/pkg/cuda-11.5/root/
     /encs/pkg/cuda-10.2/root/
     /encs/pkg/cuda-9.2/root
    @@ -1591,15 +1701,15 @@ 
    usrlocalcuda with one of the above.

    -
    2.15.4 OpenISS Examples
    +
    2.15.4 OpenISS Examples

    These represent more comprehensive research-like examples of jobs for computer vision and other tasks with a lot longer runtime (a subject to the number of epochs and other parameters) derive from the actual research works of students and their theses. These jobs require the use of CUDA and GPUs. These examples are available as “native” jobs on Speed and as Singularity containers.

    -

    OpenISS and REID - +

    OpenISS and REID + The example openiss-reid-speed.sh illustrates a job for a computer-vision based person re-identification (e.g., motion capture-based tracking for stage performance) part of the OpenISS project by Haotao Lai [10] using TensorFlow and Keras. The fork of the original repo [12] adjusted to @@ -1614,8 +1724,8 @@

    2.15 -

    OpenISS and YOLOv3 - +

    OpenISS and YOLOv3 + The related code using YOLOv3 framework is in the the fork of the original repo [11] adjusted to to run on Speed is here:

    @@ -1636,7 +1746,7 @@
    2.15
  • https://github.com/NAG-DevOps/speed-hpc/tree/master/src#openiss-yolov3
  • -

    2.16 Singularity Containers

    +

    2.16 Singularity Containers

    If the /encs software tree does not have a required software instantaneously available, another option is to run Singularity containers. We run EL7 flavor of Linux, and if some projects require Ubuntu or other distributions, there is a possibility to run that software as a container, including the ones @@ -1668,7 +1778,7 @@

    2

    -
    +   
     /speed-scratch/nag-public:
     
     openiss-cuda-conda-jupyter.sif
    @@ -1710,7 +1820,7 @@ 

    2

    -
    +   
     salloc --gpus=1 -n8 --mem=4Gb -t60
     cd /speed-scratch/$USER/
     singularity pull openiss-cuda-devicequery.sif docker://openiss/openiss-cuda-devicequery
    @@ -1723,14 +1833,14 @@ 

    2 example.

    -

    3 Conclusion

    +

    3 Conclusion

    The cluster is, “first come, first served”, until it fills, and then job position in the queue is based upon past usage. The scheduler does attempt to fill gaps, though, so sometimes a single-core job of lower priority will schedule before a multi-core job of higher priority, for example.

    -

    3.1 Important Limitations

    +

    3.1 Important Limitations

    • New users are restricted to a total of 32 cores: write to rt-ex-hpc@encs.concordia.ca if you need more temporarily (192 is the maximum, or, 6 jobs of 32 cores each). @@ -1761,7 +1871,7 @@

      3.

    -

    3.2 Tips/Tricks

    +

    3.2 Tips/Tricks

    • Files/scripts must have Linux line breaks in them (not Windows ones). Use file command to verify; and dos2unix command to convert. @@ -1784,7 +1894,7 @@

      3.2
    • E-mail, rt-ex-hpc AT encs.concordia.ca, with any concerns/questions.

    -

    3.3 Use Cases

    +

    3.3 Use Cases

    • HPC Committee’s initial batch about 6 students (end of 2019):

      @@ -1863,10 +1973,10 @@

      3.3

    -

    A History

    +

    A History

    -

    A.1 Acknowledgments

    +

    A.1 Acknowledgments

    • The first 6 (to 6.5) versions of this manual and early UGE job script samples, Singularity testing and user support were produced/done by Dr. Scott Bunnell during his time at @@ -1876,14 +1986,14 @@

      A.1
    • Dr. Tariq Daradkeh, was our IT Instructional Specialist August 2022 to September 2023; working on the scheduler, scheduling research, end user support, and integration - of examples, such as YOLOv3 in Section 2.15.4.0 other tasks. We have a continued + of examples, such as YOLOv3 in Section 2.15.4.0 other tasks. We have a continued collaboration on HPC/scheduling research.

    -

    A.2 Migration from UGE to SLURM

    +

    A.2 Migration from UGE to SLURM

    For long term users who started off with Grid Engine here are some resources to make a transition and mapping to the job submission process.

    @@ -1895,7 +2005,7 @@

    +
          GE  => SLURM
          s.q    ps
          g.q    pg
    @@ -1905,12 +2015,12 @@ 

    -

    Commands and command options mappings are found in Figure 10 from
    https://slurm.schedmd.com/rosetta.pdf
    https://slurm.schedmd.com/pdfs/summary.pdf
    Other related helpful resources from similar organizations who either used SLURM for awhile or +

    Commands and command options mappings are found in Figure 11 from
    https://slurm.schedmd.com/rosetta.pdf
    https://slurm.schedmd.com/pdfs/summary.pdf
    Other related helpful resources from similar organizations who either used SLURM for awhile or also transitioned to it:
    https://docs.alliancecan.ca/wiki/Running_jobs
    https://www.depts.ttu.edu/hpcc/userguides/general_guides/Conversion_Table_1.pdf
    https://docs.mpcdf.mpg.de/doc/computing/clusters/aux/migration-from-sge-to-slurm

    - PIC -
    Figure 10: Rosetta Mappings of Scheduler Commands from SchedMD
    + PIC +
    Figure 11: Rosetta Mappings of Scheduler Commands from SchedMD
  • @@ -1922,7 +2032,7 @@

    +
          # Speed environment set up
          if ($HOSTNAME == speed-submit.encs.concordia.ca) then
             source /local/pkg/uge-8.6.3/root/default/common/settings.csh
    @@ -1934,7 +2044,7 @@ 

    +
          # Speed environment set up
          if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then
              . /local/pkg/uge-8.6.3/root/default/common/settings.sh
    @@ -1948,21 +2058,21 @@ 

    -

    A.3 Phases

    +

    A.3 Phases

    Brief summary of Speed evolution phases.

    -
    A.3.1 Phase 4
    +
    A.3.1 Phase 4

    Phase 4 had 7 SuperMicro servers with 4x A100 80GB GPUs each added, dubbed as “SPEED2”. We also moved from Grid Engine to SLURM.

    -
    A.3.2 Phase 3
    +
    A.3.2 Phase 3

    Phase 3 had 4 vidpro nodes added from Dr. Amer totalling 6x P6 and 6x V100 GPUs added.

    -
    A.3.3 Phase 2
    +
    A.3.3 Phase 2

    Phase 2 saw 6x NVIDIA Tesla P6 added and 8x more compute nodes. The P6s replaced 4x of FirePro S7150. @@ -1970,7 +2080,7 @@

    A.3.3

    -
    A.3.4 Phase 1
    +
    A.3.4 Phase 1

    Phase 1 of Speed was of the following configuration:

      @@ -1981,20 +2091,20 @@
      A.3.4

    -

    B Frequently Asked Questions

    +

    B Frequently Asked Questions

    -

    B.1 Where do I learn about Linux?

    +

    B.1 Where do I learn about Linux?

    All Speed users are expected to have a basic understanding of Linux and its commonly used commands.

    -
    Software Carpentry
    +
    Software Carpentry

    Software Carpentry provides free resources to learn software, including a workshop on the Unix shell. https://software-carpentry.org/lessons/

    -
    Udemy
    +
    Udemy

    There are a number of Udemy courses, including free ones, that will assist you in learning Linux. Active Concordia faculty, staff and students have access to Udemy courses. The course Linux Mastery: Master the Linux Command Line in 11.5 Hours is a good starting point for @@ -2005,25 +2115,25 @@

    Udemy

    -

    B.2 How to use the “bash shell” on Speed?

    +

    B.2 How to use the “bash shell” on Speed?

    This section describes how to use the “bash shell” on Speed. Review Section 2.1.2 to ensure that your bash environment is set up.

    -
    B.2.1 How do I set bash as my login shell?
    +
    B.2.1 How do I set bash as my login shell?

    In order to set your default login shell to bash on Speed, your login shell on all GCS servers must be changed to bash. To make this change, create a ticket with the Service Desk (or email help at concordia.ca) to request that bash become your default login shell for your ENCS user account on all GCS servers.

    -
    B.2.2 How do I move into a bash shell on Speed?
    +
    B.2.2 How do I move into a bash shell on Speed?

    To move to the bash shell, type bash at the command prompt. For example:

    -
    +   
     [speed-submit] [/home/a/a_user] > bash
     bash-4.4$ echo $0
     bash
    @@ -2033,7 +2143,7 @@ 
    bash-4.4$ after entering the bash shell.

    -
    B.2.3 How do I use the bash shell in an interactive session on Speed?
    +
    B.2.3 How do I use the bash shell in an interactive session on Speed?

    Below are examples of how to use bash as a shell in your interactive job sessions with both the salloc and srun commands.

    @@ -2043,41 +2153,41 @@
    srun --mem=50G -n 5 --pty /encs/bin/bash

  • Note: Make sure the interactive job requests memory, cores, etc.

    -
    B.2.4 How do I run scripts written in bash on Speed?
    +
    B.2.4 How do I run scripts written in bash on Speed?

    To execute bash scripts on Speed:

      -
    1. Ensure that the shebang of your bash job script is #!/encs/bin/bash +
    2. Ensure that the shebang of your bash job script is #!/encs/bin/bash
    3. -
    4. Use the sbatch command to submit your job script to the scheduler.
    +
  • Use the sbatch command to submit your job script to the scheduler.
  • The Speed GitHub contains a sample bash job script.

    -

    B.3 How to resolve “Disk quota exceeded” errors?

    +

    B.3 How to resolve “Disk quota exceeded” errors?

    -
    B.3.1 Probable Cause
    +
    B.3.1 Probable Cause

    The “Disk quota exceeded” Error occurs when your application has run out of disk space to write to. On Speed this error can be returned when:

      -
    1. Your NFS-provided home is full and cannot be written to. You can verify this using quota +
    2. Your NFS-provided home is full and cannot be written to. You can verify this using quota and bigfiles commands.
    3. -
    4. The /tmp directory on the speed node your application is running on is full and cannot +
    5. The /tmp directory on the speed node your application is running on is full and cannot be written to.

    -
    B.3.2 Possible Solutions
    +
    B.3.2 Possible Solutions

      -
    1. Use the --chdir job script option to set the directory that the job script is submitted +
    2. Use the --chdir job script option to set the directory that the job script is submitted from the job working directory. The job working directory is the directory that the job will write output files in.
    3. -
    4. +
    5. The use local disk space is generally recommended for IO intensive operations. However, as the size of /tmp on speed nodes is 1TB it can be necessary for scripts to store temporary data elsewhere. Review the documentation for each module called within your script to determine @@ -2098,7 +2208,7 @@

      B.

      -
      +         
                mkdir -m 750 /speed-scratch/$USER/output
                 
       
      @@ -2110,7 +2220,7 @@
      B.

      -
      +         
                mkdir -m 750 /speed-scratch/$USER/recovery
       

      @@ -2121,7 +2231,7 @@

      B.

      In the above example, $USER is an environment variable containing your ENCS username.

      -
      B.3.3 Example of setting working directories for COMSOL
      +
      B.3.3 Example of setting working directories for COMSOL
      • Create directories for recovery, temporary, and configuration files. For example, to create these @@ -2130,7 +2240,7 @@

        +
              mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config}
         

        @@ -2142,7 +2252,7 @@

        +
              -recoverydir /speed-scratch/$USER/comsol/recovery
              -tmpdir /speed-scratch/$USER/comsol/tmp
              -configuration/speed-scratch/$USER/comsol/config
        @@ -2151,7 +2261,7 @@ 
        In the above example, $USER is an environment variable containing your ENCS username.

        -
        B.3.4 Example of setting working directories for Python Modules
        +
        B.3.4 Example of setting working directories for Python Modules

        By default when adding a python module the /tmp directory is set as the temporary repository for files downloads. The size of the /tmp directory on speed-submit is too small for pytorch. To add a python module

        @@ -2162,7 +2272,7 @@
        +
                mkdir /speed-scratch/$USER/tmp
         

        @@ -2173,7 +2283,7 @@

        +
                setenv TMPDIR /speed-scratch/$USER/tmp
         

        @@ -2182,17 +2292,17 @@

        In the above example, $USER is an environment variable containing your ENCS username.

        -

        B.4 How do I check my job’s status?

        +

        B.4 How do I check my job’s status?

        When a job with a job id of 1234 is running or terminated, the status of that job can be tracked using ‘sacct -j 1234’. squeue -j 1234 can show while the job is sitting in the queue as well. Long term statistics on the job after its terminated can be found using sstat -j 1234 after slurmctld purges it its tracking state into the database.

        -

        B.5 Why is my job pending when nodes are empty?

        +

        B.5 Why is my job pending when nodes are empty?

        -
        B.5.1 Disabled nodes
        +
        B.5.1 Disabled nodes

        It is possible that one or a number of the Speed nodes are disabled. Nodes are disabled if they require maintenance. To verify if Speed nodes are disabled, see if they are in a draining or drained state: @@ -2200,7 +2310,7 @@

        B.5.1

        -
        +   
         [serguei@speed-submit src] % sinfo --long --Node
         Thu Oct 19 21:25:12 2023
         NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
        @@ -2249,7 +2359,7 @@ 
        B.5.1 and the disabled nodes have a state of idle.

        -
        B.5.2 Error in job submit request.
        +
        B.5.2 Error in job submit request.

        It is possible that your job is pending, because the job requested resources that are not available within Speed. To verify why job id 1234 is not running, execute ‘sacct -j 1234’. A summary of the reasons is available via the squeue command. @@ -2258,7 +2368,7 @@

        C Sister Facilities
        +

        C Sister Facilities

        Below is a list of resources and facilities similar to Speed at various capacities. Depending on your research group and needs, they might be available to you. They are not managed by HPC/NAG of AITS, so contact their respective representatives. @@ -2313,8 +2423,8 @@

        C -

        References

        +

        +

        References