Skip to content

Commit

Permalink
Merge pull request NAG-DevOps#31 from NAG-DevOps/slurm
Browse files Browse the repository at this point in the history
translate the manual and examples from GE to SLURM
  • Loading branch information
smokhov authored Nov 13, 2023
2 parents b804619 + d826d14 commit 0711e03
Show file tree
Hide file tree
Showing 40 changed files with 1,560 additions and 672 deletions.
27 changes: 18 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,43 +11,52 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs
* [Overview Slides](https://docs.google.com/presentation/d/1zu4OQBU7mbj0e34Wr3ILXLPWomkhBgqGZ8j8xYrLf44/edit?usp=sharing)
* [AITS Service Desk](https://www.concordia.ca/ginacody/aits.html)

## Examples ##
## Examples

* [`src/`](src/) -- sample job scripts
* [`doc/`](doc/) -- user manual sources

## Contributing and TODO ##
## Contributing and TODO

* [Public issue tracker](https://github.com/NAG-DevOps/speed-hpc/issues)
* [Contributions (pull requests)](https://github.com/NAG-DevOps/speed-hpc/pulls) are welcome for your sample job scripts or links/references (subject to reviews)
* For Internal access and support requests, please see the GCS Speed Facility link above

### Contributors ###
### Contributors

* See the overall contributors [here](https://github.com/NAG-DevOps/speed-hpc/graphs/contributors)
* [Serguei A. Mokhov](https://github.com/smokhov) -- project lead
* HPC/Research support team: [Gillian A. Roper](https://github.com/yulgroper), [Carlos Alarcon Meza](https://github.com/carlos-encs), [Tariq Daradkeh](https://github.com/tariqghd)
* [Anh H Nguyen](https://github.com/aaanh) contributed the [HTML](https://nag-devops.github.io/speed-hpc/) version of the manual and its generation off our LaTeX sources as well as the corresponding [devcontainer](https://github.com/NAG-DevOps/speed-hpc/tree/master/doc/.devcontainer) environment
* The initial Grid Engine V6 manual was written by Dr. Scott Bunnell

## References ##
## References

### Conferences ###
### Conferences

* Serguei Mokhov, Jonathan Llewellyn, Carlos Alarcon Meza, Tariq Daradkeh, and Gillian Roper. 2023. **The use of Containers in OpenGL, ML and HPC for Teaching and Research Support.** In ACM SIGGRAPH 2023 Posters (SIGGRAPH '23). Association for Computing Machinery, New York, NY, USA, Article 49, 1–2. [DOI: 10.1145/3588028.3603676](https://doi.org/10.1145/3588028.3603676)

### Related Repositories ###
### Related Repositories

* [OpenISS Dockerfiles](https://github.com/NAG-DevOps/openiss-dockerfiles) -- the source of the Docker containers for the above poster as well as Singularity images based off it for Speed
* Sample complete more complex projects' repos than baby jobs based on the work of students and their theses:
* https://github.com/NAG-DevOps/openiss-yolov3
* https://github.com/NAG-DevOps/openiss-reid-tfk
* https://github.com/NAG-DevOps/kg-recommendation-framework

### Technical ###
### Technical

* [Slurm Workload Manager](https://en.wikipedia.org/wiki/Slurm_Workload_Manager)
* [Linux and other tutorials from Software Carpentry](https://software-carpentry.org/lessons/)
* [Digital Research Alliance of Canada SLURM Examples](https://docs.alliancecan.ca/wiki/Running_jobs)
* Concordia's subscription to [Udemy resources](https://www.concordia.ca/it/services/udemy.html)
* [NVIDIA Tesla P6](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/Tesla-P6-Product-Brief.pdf)
* [AMD Tonga FirePro S7100X](https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#FirePro_Server_Series_(S000x/Sxx_000))

### Legacy

Speed no longer runs Grid Engine; these are provided for reference only.

* [Altair Grid Engine (AGE)](https://www.altair.com/grid-engine/) (formely [Univa Grid Engine (UGE)](https://en.wikipedia.org/wiki/Univa_Grid_Engine))
* [UGE User Guide for version 8.6.3 (current version running on speed)](https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/UsersGuideGE.pdf)
* [Altair product documentation](https://community.altair.com/community?id=altair_product_documentation)
* [NVIDIA Tesla P6](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/Tesla-P6-Product-Brief.pdf)
* [AMD Tonga FirePro S7100X](https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#FirePro_Server_Series_(S000x/Sxx_000))
5 changes: 3 additions & 2 deletions SECURITY.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,9 @@

| Version | Supported |
| ------- | ------------------ |
| 6.6.x | :white_check_mark: |
| 6.5.x | :x: |
| 7.x | :white_check_mark: |
| 6.6.x | :x: |
| 6.5.x | :x: |
| < 6.5 | :x: |

## Reporting a Vulnerability
Expand Down
Binary file removed doc/GE/AdminsGuideGE.pdf
Binary file not shown.
Binary file removed doc/GE/IntroductionGE.pdf
Binary file not shown.
Binary file removed doc/GE/ManpageReferenceGE.pdf
Binary file not shown.
Binary file removed doc/GE/TroubleShootingQuickReferenceGE.pdf
Binary file not shown.
Binary file removed doc/GE/UsersGuideGE.pdf
Binary file not shown.
Binary file added doc/images/pycharm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/rosetta-mapping.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/slurm-arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/images/speed-pics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
120 changes: 94 additions & 26 deletions doc/scheduler-directives.tex
Original file line number Diff line number Diff line change
@@ -1,51 +1,119 @@
% ------------------------------------------------------------------------------
% ------------------------------------------------------------------------------
\subsubsection{Directives}
\label{sect:directives}

Directives are comments included at the beginning of a job script that set the shell
and the options for the job scheduler.

%
The shebang directive is always the first line of a script. In your job script,
this directive sets which shell your script's commands will run in. On ``Speed'',
we recommend that your script use a shell from the \texttt{/encs/bin} directory.

To use the \texttt{tcsh} shell, start your script with: \verb|#!/encs/bin/tcsh|
To use the \texttt{tcsh} shell, start your script with \verb|#!/encs/bin/tcsh|.
%
For \texttt{bash}, start with \verb|#!/encs/bin/bash|.
%
Directives that start with \verb|#SBATCH|, set the options for the cluster's
SLURM scheduler. The script template, \texttt{template.sh},
provides the essentials:

For \texttt{bash}, start with: \verb|#!/encs/bin/bash|
%\begin{verbatim}
%#$ -N <jobname>
%#$ -cwd
%#$ -m bea
%#$ -pe smp <corecount>
%#$ -l h_vmem=<memory>G
%\end{verbatim}
\begin{verbatim}
#SBATCH --job-name=tmpdir ## Give the job a name
#SBATCH --mail-type=ALL ## Receive all email type notifications
#SBATCH [email protected]
#SBATCH --chdir=./ ## Use current directory as working directory
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<corecount> ## Request, e.g. 8 cores
#SBATCH --mem=<memory> ## Assign, e.g., 32G memory per node
\end{verbatim}

Directives that start with \verb|"#$"|, set the options for the cluster's
``Altair Grid Engine (AGE)'' scheduler. The script template, \file{template.sh},
provides the essentials:
and its short option equivalents:

\begin{verbatim}
#$ -N <jobname>
#$ -cwd
#$ -m bea
#$ -pe smp <corecount>
#$ -l h_vmem=<memory>G
#SBATCH -J tmpdir ## Give the job a name
#SBATCH --mail-type=ALL ## Receive all email type notifications
#SBATCH [email protected]
#SBATCH --chdir=./ ## Use current directory as working directory
#SBATCH -N 1
#SBATCH -n 8 ## Request 8 cores
#SBATCH --mem=32G ## Assign 32G memory per node
\end{verbatim}

Replace, \verb+<jobname>+, with the name that you want your cluster job to have;
\option{-cwd}, makes the current working directory the ``job working directory'',
and your standard output file will appear here; \option{-m bea}, provides e-mail
notifications (begin/end/abort); replace, \verb+<corecount>+, with the degree of
(multithreaded) parallelism (i.e., cores) you attach to your job (up to 32),
be sure to delete or comment out the \verb| #$ -pe smp | parameter if it
is not relevant; replace, \verb+<memory>+, with the value (in GB), that you want
your job's memory space to be (up to 500), and all jobs MUST have a memory-space
assignment.
\option{--chdir}, makes the current working directory the ``job working directory'',
and your standard output file will appear here; \option{--mail-type}, provides e-mail
notifications (success, error, etc. or all); replace, \verb+<corecount>+, with the degree of
(multithreaded) parallelism (i.e., cores) you attach to your job (up to 32 by default).
%be sure to delete or comment out the \verb| #$ -pe smp | parameter if it
%is not relevant;

Replace, \verb+<memory>+, with the value (in GB), that you want
your job's memory space to be (up to 500 depending on the node), and all jobs MUST have a memory-space
assignment.
%
If you are unsure about memory footprints, err on assigning a generous
memory space to your job so that it does not get prematurely terminated
(the value given to \api{h\_vmem} is a hard memory ceiling). You can refine
\api{h\_vmem} values for future jobs by monitoring the size of a job's active
memory space to your job, so that it does not get prematurely terminated.
%(the value given to \api{h\_vmem} is a hard memory ceiling).
You can refine
%\api{h\_vmem}
\option{--mem}
values for future jobs by monitoring the size of a job's active
memory space on \texttt{speed-submit} with:

%\begin{verbatim}
%qstat -j <jobID> | grep maxvmem
%\end{verbatim}

\begin{verbatim}
qstat -j <jobID> | grep maxvmem
sacct -j <jobID>
sstat -j <jobID>
\end{verbatim}

Memory-footprint values are also provided for completed jobs in the final
e-mail notification (as, ``Max vmem'').
\noindent
This can be customized to show specific columns:

\begin{verbatim}
sacct -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j <jobID>
sstat -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j <jobID>
\end{verbatim}

Memory-footprint values are also provided for completed jobs in the final
e-mail notification (as, ``maxvmsize'').
%
\emph{Jobs that request a low-memory footprint are more likely to load on a busy
cluster.}

Other essential options are \option{-t} and \option{-A}.
%
\begin{itemize}
\item
\option{-t} -- is the time estimate how long your job may run. This is
used in scheduling priority of your job. The maximum already mentioned
is 7 days for batch and 24 hours for interactive. Specifying lesser
time may have your job scheduled sooner. The ``best'' value for this
does not exist and is often determined empirically from the past runs.

\item
\option{-A} -- to what projects/associations attribute the accounting to. This is usually
your research or supervisor group or a project or some kind of
association. When moving from GE to SLURM we ported most users to
two default accounts \texttt{speed1} and \texttt{speed2}. These
are generic catch-all accounts if you are unsure what to use.
Normally we tell in our intro email which one to use, which may
be your default account. For example,
\texttt{aits},
\texttt{vidpro},
\texttt{gipsy},
\texttt{ai2},
\texttt{mpackir},
\texttt{cmos}, among others.

\end{itemize}
148 changes: 78 additions & 70 deletions doc/scheduler-env.tex
Original file line number Diff line number Diff line change
@@ -1,34 +1,42 @@
% ------------------------------------------------------------------------------
% ------------------------------------------------------------------------------
\subsubsection{Environment Set Up}
\label{sect:envsetup}

After creating an SSH connection to ``Speed'', you will need to source
the ``Altair Grid Engine (AGE)'' scheduler's settings file.
Sourcing the settings file will set the environment variables required to
execute scheduler commands.

Based on the UNIX shell type, choose one of the following commands to source
the settings file.

csh/\tool{tcsh}:
\begin{verbatim}
source /local/pkg/uge-8.6.3/root/default/common/settings.csh
\end{verbatim}

Bourne shell/\tool{bash}:
\begin{verbatim}
. /local/pkg/uge-8.6.3/root/default/common/settings.sh
\end{verbatim}

In order to set up the default ENCS bash shell, executing the following command
is also required:
\begin{verbatim}
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
\end{verbatim}

To verify that you have access to the scheduler commands execute
\texttt{qstat -f -u "*"}. If an error is returned, attempt sourcing
the settings file again.
After creating an SSH connection to Speed, you will need to
make sure the \tool{srun}, \tool{sbatch}, and \tool{salloc}
commands are available to you.
Type the command name at the command prompt and press enter.
If the command is not available, e.g., (``command not found'') is returned,
you need to make sure your \api{\$PATH} has \texttt{/local/bin} in it.
To view your \api{\$PATH} type \texttt{echo \$PATH} at the prompt.
%
%source
%the ``Altair Grid Engine (AGE)'' scheduler's settings file.
%Sourcing the settings file will set the environment variables required to
%execute scheduler commands.
%
%Based on the UNIX shell type, choose one of the following commands to source
%the settings file.
%
%csh/\tool{tcsh}:
%\begin{verbatim}
%source /local/pkg/uge-8.6.3/root/default/common/settings.csh
%\end{verbatim}
%
%Bourne shell/\tool{bash}:
%\begin{verbatim}
%. /local/pkg/uge-8.6.3/root/default/common/settings.sh
%\end{verbatim}
%
%In order to set up the default ENCS bash shell, executing the following command
%is also required:
%\begin{verbatim}
%printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
%\end{verbatim}
%
%To verify that you have access to the scheduler commands execute
%\texttt{qstat -f -u "*"}. If an error is returned, attempt sourcing
%the settings file again.

The next step is to copy a job template to your home directory and to set up your
cluster-specific storage. Execute the following command from within your
Expand All @@ -39,50 +47,50 @@ \subsubsection{Environment Set Up}
cp /home/n/nul-uge/template.sh . && mkdir /speed-scratch/$USER
\end{verbatim}

\textbf{Tip:} Add the source command to your shell-startup script.
%\textbf{Tip:} Add the source command to your shell-startup script.

\textbf{Tip:} the default shell for GCS ENCS users is \tool{tcsh}.
If you would like to use \tool{bash}, please contact
\texttt{rt-ex-hpc AT encs.concordia.ca}.

For \textbf{new ENCS Users}, and/or those who don't have a shell-startup script,
based on your shell type use one of the following commands to copy a start up script
from \texttt{nul-uge}'s. home directory to your home directory. (To move to your home
directory, type \tool{cd} at the Linux prompt and press \texttt{Enter}.)

csh/\tool{tcsh}:
\begin{verbatim}
cp /home/n/nul-uge/.tcshrc .
\end{verbatim}

Bourne shell/\tool{bash}:
\begin{verbatim}
cp /home/n/nul-uge/.bashrc .
\end{verbatim}

Users who already have a shell-startup script, use a text editor, such as
\tool{vim} or \tool{emacs}, to add the source request to your existing
shell-startup environment (i.e., to the \file{.tcshrc} file in your home directory).

csh/\tool{tcsh}:
Sample \file{.tcshrc} file:
\begin{verbatim}
# Speed environment set up
if ($HOSTNAME == speed-submit.encs.concordia.ca) then
source /local/pkg/uge-8.6.3/root/default/common/settings.csh
endif
\end{verbatim}

Bourne shell/\tool{bash}:
Sample \file{.bashrc} file:
\begin{verbatim}
# Speed environment set up
if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then
. /local/pkg/uge-8.6.3/root/default/common/settings.sh
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
fi
\end{verbatim}

Note that you will need to either log out and back in, or execute a new shell,
for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied
(\textbf{important}).
%For \textbf{new GCS ENCS Users}, and/or those who don't have a shell-startup script,
%based on your shell type use one of the following commands to copy a start up script
%from \texttt{nul-uge}'s home directory to your home directory. (To move to your home
%directory, type \tool{cd} at the Linux prompt and press \texttt{Enter}.)

%csh/\tool{tcsh}:
%\begin{verbatim}
%cp /home/n/nul-uge/.tcshrc .
%\end{verbatim}

%Bourne shell/\tool{bash}:
%\begin{verbatim}
%cp /home/n/nul-uge/.bashrc .
%\end{verbatim}

%Users who already have a shell-startup script, can use a text editor, such as
%\tool{vim} or \tool{emacs}, to add the source request to your existing
%shell-startup environment (i.e., to the \file{.tcshrc} file in your home directory).

%csh/\tool{tcsh}:
%Sample \file{.tcshrc} file:
%\begin{verbatim}
%# Speed environment set up
%if ($HOSTNAME == speed-submit.encs.concordia.ca) then
%source /local/pkg/uge-8.6.3/root/default/common/settings.csh
%endif
%\end{verbatim}
%
%Bourne shell/\tool{bash}:
%Sample \file{.bashrc} file:
%\begin{verbatim}
%# Speed environment set up
%if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then
%. /local/pkg/uge-8.6.3/root/default/common/settings.sh
%printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
%fi
%\end{verbatim}

Note, if you are getting ``command not found'' error(s) when logging in, you
probably have old Grid Engine environment commands. Remove them
as per \xa{appdx:uge-to-slurm}.
Loading

0 comments on commit 0711e03

Please sign in to comment.