Skip to content

Commit

Permalink
Merge pull request NAG-DevOps#52 from NAG-DevOps/manual-update
Browse files Browse the repository at this point in the history
Manual updates for 7.2
  • Loading branch information
smokhov authored Aug 22, 2024
2 parents 3cadb16 + 95bbee8 commit ebed541
Show file tree
Hide file tree
Showing 13 changed files with 5,435 additions and 1,846 deletions.
15 changes: 13 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs
* [`src/`](src/) -- sample job scripts
* [`doc/`](doc/) -- user manual sources

## Software List

* [EL7 and EL9 Software List](software-list.md) on Speed

## Contributing and TODO

* [Public issue tracker](https://github.com/NAG-DevOps/speed-hpc/issues)
Expand All @@ -34,6 +38,7 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs

### Conferences

* Tariq Daradkeh, Gillian Roper, Carlos Alarcon Meza, and Serguei Mokhov. **HPC jobs classification and resource prediction to minimize job failures.** In International Conference on Computer Systems and Technologies 2024 (CompSysTech ’24), New York, NY, USA, June 2024. ACM. [DOI: 10.1145/3674912.3674914](https://doi.org/10.1145/3674912.3674914)
* Serguei Mokhov, Jonathan Llewellyn, Carlos Alarcon Meza, Tariq Daradkeh, and Gillian Roper. 2023. **The use of Containers in OpenGL, ML and HPC for Teaching and Research Support.** In ACM SIGGRAPH 2023 Posters (SIGGRAPH '23). Association for Computing Machinery, New York, NY, USA, Article 49, 1–2. [DOI: 10.1145/3588028.3603676](https://doi.org/10.1145/3588028.3603676)

### Related Repositories
Expand All @@ -44,13 +49,19 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs
* https://github.com/NAG-DevOps/openiss-reid-tfk
* https://github.com/NAG-DevOps/kg-recommendation-framework

### Technical
### Educational

* [Slurm Workload Manager](https://en.wikipedia.org/wiki/Slurm_Workload_Manager)
* [Linux and other tutorials from Software Carpentry](https://software-carpentry.org/lessons/)
* [Digital Research Alliance of Canada SLURM Examples](https://docs.alliancecan.ca/wiki/Running_jobs)
* Concordia's subscription to [Udemy resources](https://www.concordia.ca/it/services/udemy.html)

### Technical

* [Slurm Workload Manager](https://en.wikipedia.org/wiki/Slurm_Workload_Manager)
* [NVIDIA A100](https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf)
* [NVIDIA V100](https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf)
* [NVIDIA Tesla P6](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/Tesla-P6-Product-Brief.pdf)
* [NVIDIA RTX 6000 Ada Generation](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/rtx-6000/proviz-print-rtx6000-datasheet-web-2504660.pdf)
* [AMD Tonga FirePro S7100X](https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#FirePro_Server_Series_(S000x/Sxx_000))

### Legacy
Expand Down
8 changes: 7 additions & 1 deletion doc/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ all: $(DELIVERABLE).pdf
#all: arxiv acm
#all: arxiv

$(DELIVERABLE).pdf: $(DELIVERABLE).tex $(DELIVERABLE).bib Makefile commands.tex
$(DELIVERABLE).pdf: $(DELIVERABLE).tex $(DELIVERABLE).bib Makefile commands.tex software-list.tex
@echo "Compiling *.tex files..."
pdflatex $(PDFLATEXFLAGS) $(DELIVERABLE)
@echo "Compiling bibliography..."
Expand Down Expand Up @@ -53,6 +53,12 @@ $(DELIVERABLE)-arxiv.tex: to-arxiv.pl $(DELIVERABLE).tex
./to-arxiv.pl < $(DELIVERABLE).tex > $(DELIVERABLE)-arxiv.tex
perl -pi -e 's/\{content\}/\{content-arxiv\}/g' $(DELIVERABLE)-arxiv.tex

software-list: software-list.tex ../software-list.md
software-list.tex ../software-list.md: generate-software-list.sh
@echo "Generating software list. Don't forget to run make afterwards to recompile the manual."
./generate-software-list.sh
mv -f software-list.md ..

acm: $(DELIVERABLE)-acm.pdf

$(DELIVERABLE)-acm.pdf: $(DELIVERABLE)-acm.tex content-acm.tex Makefile
Expand Down
79 changes: 79 additions & 0 deletions doc/generate-software-list.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/encs/bin/bash

# Generates .tex and .md versions of the software list
# Serguei Mokhov

GENERATED_ON=`date`
OUTFILE="software-list"

# Generate the LaTeX version first
cat > "$OUTFILE.tex" << LATEX_HEADER
% -----------------------------------------------------------------------------
% $0
\section{Software Installed On Speed}
\label{sect:software-details}
This is a generated section by a script; last updated on \textit{$GENERATED_ON}.
We have two major software trees: Scientific Linux 7 (EL7), which is
outgoing, and AlmaLinux 9 (EL9). After major synchronization of software
packages is complete, we will stop maintaining the EL7 tree and
will migrate the remaining nodes to EL9.
Use \option{--constraint=el7} to select EL7-only installed nodes for their
software packages. Conversely, use \option{--constraint=el9} for the EL9-only
software. These options would be used as a part of your job parameters
in either \api{\#SBATCH} or on the command line.
\noindent
\textbf{NOTE:} this list does not include packages installed directly on the OS (yet).
% -----------------------------------------------------------------------------
\subsection{EL7}
\label{sect:software-el7}
Not all packages are intended for HPC, but the common tree is available
on Speed as well as teaching labs' desktops.
\scriptsize
\begin{multicols}{3}
\begin{itemize}
LATEX_HEADER

ls -1 /encs/ArchDep/x86_64.EL7/pkg/ \
| egrep -v HIDE \
| sed 's/^/\\item \\verb|/g' \
| sed 's/$/|/g' \
>> "$OUTFILE.tex"

cat >> "$OUTFILE.tex" << LATEX_EL9_HEADER
\end{itemize}
\end{multicols}
\normalsize
% -----------------------------------------------------------------------------
\subsection{EL9}
\label{sect:software-el9}
\scriptsize
\begin{multicols}{3}
\begin{itemize}
LATEX_EL9_HEADER

ls -1 /encs/ArchDep/x86_64.EL9/pkg/ \
| egrep -v HIDE \
| sed 's/^/\\item \\verb|/g' \
| sed 's/$/|/g' \
>> "$OUTFILE.tex"

cat >> "$OUTFILE.tex" << LATEX_FOOTER
\end{itemize}
\end{multicols}
\normalsize
% EOF
LATEX_FOOTER

# Get .md version of the same from LaTeX
pandoc -s "$OUTFILE.tex" -o "$OUTFILE.md"

# EOF
Binary file added doc/images/speed-architecture-full.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
151 changes: 66 additions & 85 deletions doc/scheduler-directives.tex
Original file line number Diff line number Diff line change
@@ -1,115 +1,96 @@
% ------------------------------------------------------------------------------
\subsubsection{Directives}
\label{sect:directives}
% 2.2.1 Directives
% -------------------
% TMP scheduler-specific section

Directives are comments included at the beginning of a job script that set the shell
and the options for the job scheduler.
and the options for the job scheduler.
%
The shebang directive is always the first line of a script. In your job script,
this directive sets which shell your script's commands will run in. On ``Speed'',
we recommend that your script use a shell from the \texttt{/encs/bin} directory.
we recommend that your script use a shell from the \texttt{/encs/bin} directory.\\

To use the \texttt{tcsh} shell, start your script with \verb|#!/encs/bin/tcsh|.
%
For \texttt{bash}, start with \verb|#!/encs/bin/bash|.
%
Directives that start with \verb|#SBATCH|, set the options for the cluster's
Slurm job scheduler. The script template, \texttt{template.sh},
provides the essentials:
For \texttt{bash}, start with \verb|#!/encs/bin/bash|.\\

Directives that start with \verb|#SBATCH| set the options for the cluster's
SLURM job scheduler. The following provides an example of some essential directives:

%\begin{verbatim}
%#$ -N <jobname>
%#$ -cwd
%#$ -m bea
%#$ -pe smp <corecount>
%#$ -l h_vmem=<memory>G
%\end{verbatim}
\small
\begin{verbatim}
#SBATCH --job-name=<jobname> ## or -J. Give the job a name
#SBATCH --mail-type=<type> ## Set type of email notifications
#SBATCH --chdir=<directory> ## or -D, Set working directory where output files will go
#SBATCH --nodes=1 ## or -N, Node count required for the job
#SBATCH --ntasks=1 ## or -n, Number of tasks to be launched
#SBATCH --cpus-per-task=<corecount> ## or -c, Core count requested, e.g. 8 cores
#SBATCH --mem=<memory> ## Assign memory for this job, e.g., 32G memory per node
#SBATCH --job-name=<jobname> ## or -J. Give the job a name
#SBATCH --mail-type=<type> ## set type of email notifications
#SBATCH --chdir=<directory> ## or -D, set working directory for the job
#SBATCH --nodes=1 ## or -N, node count required for the job
#SBATCH --ntasks=1 ## or -n, number of tasks to be launched
#SBATCH --cpus-per-task=<corecount> ## or -c, core count requested, e.g. 8 cores
#SBATCH --mem=<memory> ## assign memory for this job,
## e.g., 32G memory per node
\end{verbatim}
\normalsize

Replace the following to adjust the job script for your project(s)
\begin{enumerate}
\item \verb+<jobname>+ with a job name for the job
\item \verb+<directory>+ with the fullpath to your job's working directory, e.g., where your code,
source files and where the standard output files will be written to. By default, \verb+--chdir+
sets the current directory as the job's working directory
\item \verb+<type>+ with the type of e-mail notifications you wish to receive. Valid options are: NONE, BEGIN, END, FAIL, REQUEUE, ALL
\item \verb+<corecount>+ with the degree of multithreaded parallelism (i.e., cores) allocated to your job. Up to 32 by default.
\item \verb+<memory>+ with the amount of memory, in GB, that you want to be allocated per node. Up to 500 depending on the node.
NOTE: All jobs MUST set a value for the \verb|--mem| option.
\end{enumerate}

Example with short option equivalents:
\noindent Replace the following to adjust the job script for your project(s)
\begin{itemize}
\item \verb+<jobname>+ with a job name for the job. This name will be displayed in the job queue.
\item \verb+<directory>+ with the fullpath to your job's working directory, e.g., where your code,
source files and where the standard output files will be written to.
By default, \verb+--chdir+ sets the current directory as the job's working directory.
\item \verb+<type>+ with the type of e-mail notifications you wish to receive.
Valid options are: NONE, BEGIN, END, FAIL, REQUEUE, ALL.
\item \verb+<corecount>+ with the degree of multithreaded parallelism (i.e., cores) allocated to your job. Up to 32 by default.
\item \verb+<memory>+ with the amount of memory, in GB, that you want to be allocated per node. Up to 500 depending on the node.\\
\textbf{Note}: All jobs MUST set a value for the \option{--mem} option.
\end{itemize}

\noindent Example with short option equivalents:
\small
\begin{verbatim}
#SBATCH -J tmpdir ## Job's name set to 'tmpdir'
#SBATCH --mail-type=ALL ## Receive all email type notifications
#SBATCH -D ./ ## Use current directory as working directory
#SBATCH -N 1 ## Node count required for the job
#SBATCH -n 1 ## Number of tasks to be launched
#SBATCH -c 8 ## Request 8 cores
#SBATCH --mem=32G ## Allocate 32G memory per node
#SBATCH -J myjob ## Job's name set to 'myjob'
#SBATCH --mail-type=ALL ## Receive all email type notifications
#SBATCH -D ./ ## Use current directory as working directory
#SBATCH -N 1 ## Node count required for the job
#SBATCH -n 1 ## Number of tasks to be launched
#SBATCH -c 8 ## Request 8 cores
#SBATCH --mem=32G ## Allocate 32G memory per node
\end{verbatim}
\normalsize

%
If you are unsure about memory footprints, err on assigning a generous
\noindent \textbf{Tip:} If you are unsure about memory footprints, err on assigning a generous
memory space to your job, so that it does not get prematurely terminated.
%(the value given to \api{h\_vmem} is a hard memory ceiling).
You can refine
%\api{h\_vmem}
\option{--mem}
values for future jobs by monitoring the size of a job's active
You can refine \option{--mem} values for future jobs by monitoring the size of a job's active
memory space on \texttt{speed-submit} with:

%\begin{verbatim}
%qstat -j <jobID> | grep maxvmem
%\end{verbatim}

\begin{verbatim}
sacct -j <jobID>
sstat -j <jobID>
sacct -j <jobID>
sstat -j <jobID>
\end{verbatim}

\noindent
This can be customized to show specific columns:
\noindent This can be customized to show specific columns:

\begin{verbatim}
sacct -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j <jobID>
sstat -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j <jobID>
sacct -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j <jobID>
sstat -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j <jobID>
\end{verbatim}

Memory-footprint values are also provided for completed jobs in the final
e-mail notification as ``maxvmsize''.
%
\noindent Memory-footprint efficiency values (\tool{seff}) are also provided for completed jobs in the final
email notification as ``maxvmsize''.
\emph{Jobs that request a low-memory footprint are more likely to load on a busy
cluster.}
cluster.}\\

Other essential options are \option{--time}, or \verb|-t|, and \option{--account}, or \verb|-A|.
%
\noindent Other essential options are \option{--time}, or \option{-t}, and \option{--account}, or \option{-A}.
\begin{itemize}
\item
\option{--time=<time>} -- is the estimate of wall clock time required for your job to run.
As preiviously mentioned, the maximum is 7 days for batch and 24 hours for interactive jobs.
Jobs with a smaller \texttt{time} value will have a higher priority and may result in your job being scheduled sooner.

\item
\option{--account=<name>} -- specifies which Account, aka project or association,
that the Speed resources your job uses should be attributed to. When moving from
GE to SLURM users most users were assigned to Speed's two default accounts
\texttt{speed1} and \texttt{speed2}. However, users that belong to a particular research
group or project are will have a default Account like the following
\texttt{aits},
\texttt{vidpro},
\texttt{gipsy},
\texttt{ai2},
\texttt{mpackir},
\texttt{cmos}, among others.
\item \option{--time=<time>} -- is the estimate of wall clock time required for your job to run.
As previously mentioned, the maximum is 7 days for batch and 24 hours for interactive jobs.
Jobs with a smaller \texttt{time} value will have a higher priority and may result in your job being scheduled sooner.

\item \option{--account=<name>} -- specifies which Account, aka project or association,
that the Speed resources your job uses should be attributed to. When moving from
GE to SLURM users most users were assigned to Speed's two default accounts
\texttt{speed1} and \texttt{speed2}. However, users that belong to a particular research
group or project are will have a default Account like the following
\texttt{aits},
\texttt{vidpro},
\texttt{gipsy},
\texttt{ai2},
\texttt{mpackir},
\texttt{cmos}, among others.
\end{itemize}
50 changes: 29 additions & 21 deletions doc/scheduler-env.tex
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
% ------------------------------------------------------------------------------
\subsubsection{Environment Set Up}
\label{sect:envsetup}
% 2.1.2 Environment Set Up
% --------------------------
% TMP scheduler-specific section

After creating an SSH connection to Speed, you will need to
make sure the \tool{srun}, \tool{sbatch}, and \tool{salloc}
commands are available to you.
Type the command name at the command prompt and press enter.
If the command is not available, e.g., (``command not found'') is returned,
you need to make sure your \api{\$PATH} has \texttt{/local/bin} in it.
To view your \api{\$PATH} type \texttt{echo \$PATH} at the prompt.
commands are available to you.
To check this, type each command at the prompt and press Enter.
If ``command not found'' is returned, you need to make sure your \api{\$PATH}
includes \texttt{/local/bin}.
You can check your \api{\$PATH} by typing:
\begin{verbatim}
echo $PATH
\end{verbatim}

%
%source
%the ``Altair Grid Engine (AGE)'' scheduler's settings file.
Expand Down Expand Up @@ -38,20 +42,24 @@ \subsubsection{Environment Set Up}
%\texttt{qstat -f -u "*"}. If an error is returned, attempt sourcing
%the settings file again.

The next step is to copy a job template to your home directory and to set up your
cluster-specific storage. Execute the following command from within your
home directory. (To move to your home directory, type \texttt{cd} at the Linux
prompt and press \texttt{Enter}.)

\noindent The next step is to set up your cluster-specific storage ``speed-scratch'', to do so, execute the following command from within your
home directory.
\begin{verbatim}
cp /home/n/nul-uge/template.sh . && mkdir /speed-scratch/$USER
mkdir -p /speed-scratch/$USER && cd /speed-scratch/$USER
\end{verbatim}

%\textbf{Tip:} Add the source command to your shell-startup script.
\noindent Next, copy a job template to your cluster-specific storage
\begin{itemize}
\item From Windows drive G: to Speed:\\
\verb|cp /winhome/<1st letter of $USER>/$USER/example.sh /speed-scratch/$USER/|
\item From Linux drive U: to Speed:\\
\verb|cp ~/example.sh /speed-scratch/$USER/|
\end{itemize}

\noindent \textbf{Tip:} the default shell for GCS ENCS users is \tool{tcsh}.
If you would like to use \tool{bash}, please contact \texttt{rt-ex-hpc AT encs.concordia.ca}.\\

\textbf{Tip:} the default shell for GCS ENCS users is \tool{tcsh}.
If you would like to use \tool{bash}, please contact
\texttt{rt-ex-hpc AT encs.concordia.ca}.
%\textbf{Tip:} Add the source command to your shell-startup script.

%For \textbf{new GCS ENCS Users}, and/or those who don't have a shell-startup script,
%based on your shell type use one of the following commands to copy a start up script
Expand Down Expand Up @@ -91,6 +99,6 @@ \subsubsection{Environment Set Up}
%fi
%\end{verbatim}

\textbf{Note:} If a ``command not found'' error appears after you log in to speed,
your user account many have probably have defunct Grid Engine environment commands.
See \xa{appdx:uge-to-slurm} to learn how to prevent this error on login.
\noindent \textbf{Note:} If you encounter a ``command not found'' error after logging in to Speed,
your user account may have defunct Grid Engine environment commands.
See \xa{appdx:uge-to-slurm} for instructions on how to resolve this issue.
Loading

0 comments on commit ebed541

Please sign in to comment.