diff --git a/doc/images/vscode.png b/doc/images/vscode.png new file mode 100644 index 0000000..067b9bd Binary files /dev/null and b/doc/images/vscode.png differ diff --git a/doc/scheduler-scripting.tex b/doc/scheduler-scripting.tex index d695d89..99f9f33 100644 --- a/doc/scheduler-scripting.tex +++ b/doc/scheduler-scripting.tex @@ -534,7 +534,11 @@ \subsubsection{Graphical Applications} Once landed on a compute node, verify \api{DISPLAY} again. \item -While running under scheduler, unset \api{XDG\_RUNTIME\_DIR}. +While running under scheduler, create a run-user directory and set the variable \api{XDG\_RUNTIME\_DIR}. +\begin{verbatim} +mkdir -p /speed-scratch/$USER/run-dir +setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir +\end{verbatim} \item Launch your graphical application: @@ -545,7 +549,9 @@ \subsubsection{Graphical Applications} Here's an example of starting PyCharm (see \xf{fig:pycharm}), of which we made a sample local installation. You can make a similar install under your own directory. If using VSCode, it's -currently only supported with the \tool{--no-sandbox} option. +currently only supported with the \tool{--no-sandbox} option. \newline + +BASH version: \scriptsize \begin{verbatim} @@ -553,16 +559,36 @@ \subsubsection{Graphical Applications} serguei@speed's password: [serguei@speed-submit ~] % echo $DISPLAY localhost:14.0 -[serguei@speed-submit ~] % srun -p ps --pty --x11=first --mem 4000 -t 0-06:00 /encs/bin/bash +[serguei@speed-submit ~] % salloc -p ps --x11=first --mem=4Gb -t 0-06:00 bash-4.4$ echo $DISPLAY localhost:77.0 bash-4.4$ hostname speed-01.encs.concordia.ca -bash-4.4$ unset XDG_RUNTIME_DIR +bash-4.4$ export XDG_RUNTIME_DIR=/speed-scratch/$USER/run-dir bash-4.4$ /speed-scratch/nag-public/bin/pycharm.sh \end{verbatim} \normalsize +TCSH version: +\scriptsize +\begin{verbatim} +ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too) +[speed-submit] [/home/c/carlos] > echo $DISPLAY +localhost:14.0 +[speed-submit] [/home/c/carlos] > cd /speed-scratch/$USER +[speed-submit] [/speed-scratch/carlos] > echo $DISPLAY +localhost:13.0 +[speed-submit] [/speed-scratch/carlos] > salloc -pps --x11=first --mem=4Gb -t 0-06:00 +[speed-07] [/speed-scratch/carlos] > echo $DISPLAY +localhost:42.0 +[speed-07] [/speed-scratch/carlos] > hostname +speed-07.encs.concordia.ca +[speed-07] [/speed-scratch/carlos] > setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir +[speed-07] [/speed-scratch/carlos] > /speed-scratch/nag-public/bin/pycharm.sh +\end{verbatim} +\normalsize + + \begin{figure}[htpb] \includegraphics[width=\columnwidth]{images/pycharm} \caption{PyCharm Starting up on a Speed Node} @@ -645,6 +671,65 @@ \subsubsection{Jupyter Notebooks} \label{fig:jupyter} \end{figure} +% ------------------------------------------------------------------------------ +\subsubsection{VScode} +\label{sect:vscode} + +This is an example of running VScode, it's similar to Jupyter notebooks, but it doesn't use containers. +This a Web version, it exists the local(workstation)-remote(speed-node) version too, but it is for Advanced users (no support, execute it at your own risk). + +\begin{itemize} +\item +Environment preparation: for the FIRST time: +\begin{enumerate} +\item +Go to your speed-scratch directory: \texttt{cd /speed-scratch/\$USER} +\item +Create a vscode directory: \texttt{mkdir vscode} +\item +Go to vscode: \texttt{cd vscode} +\item +Create home and projects: \texttt {mkdir \{home,projects\}} +\item +Create this directory: \texttt {mkdir -p /speed-scratch/\$USER/run-user} +\end{enumerate} +\item +Running VScode +\begin{enumerate} +\item +Go to your vscode directory: \texttt{cd /speed-scratch/\$USER/vscode} +\item +Open interactive session: \texttt {salloc --mem=10Gb --constraint=el9} +\item +Set environment variable: \texttt {setenv XDG\_RUNTIME\_DIR /speed-scratch/\$USER/run-user} +\item +Run VScode, change the port if needed. +\scriptsize +\begin{verbatim} +/speed-scratch/nag-public/code-server-4.22.1/bin/code-server --user-data-dir=$PWD\/projects \ +--config=$PWD\/home/.config/code-server/config.yaml --bind-addr="0.0.0.0:8080" $PWD\/projects +\end{verbatim} +\normalsize +\item +Tunnel ssh creation: similar to Jupyter, see \xs{sect:jupyter} +\item +Open a browser and type: \texttt {localhost:8080} +\item +If the browser asks for password: +\begin{verbatim} +cat /speed-scratch/$USER/vscode/home/.config/code-server/config.yaml +\end{verbatim} + +\end{enumerate} +\end{itemize} + +\begin{figure}[htbp] + \centering + \fbox{\includegraphics[width=1.00\textwidth]{images/vscode.png}} + \caption{VScode running on a Speed node} + \label{fig:vscode} +\end{figure} + % ------------------------------------------------------------------------------ \subsection{Scheduler Environment Variables} \label{sect:env-vars} diff --git a/doc/speed-manual.pdf b/doc/speed-manual.pdf index 6660467..968017e 100644 Binary files a/doc/speed-manual.pdf and b/doc/speed-manual.pdf differ diff --git a/doc/web/images/vscode.png b/doc/web/images/vscode.png new file mode 100644 index 0000000..067b9bd Binary files /dev/null and b/doc/web/images/vscode.png differ diff --git a/doc/web/index.html b/doc/web/index.html index ee80e67..99fe53e 100644 --- a/doc/web/index.html +++ b/doc/web/index.html @@ -68,52 +68,53 @@
While running under scheduler, create a run-user directory and set the variable + XDG_RUNTIME_DIR. + + + +
++ mkdir -p /speed-scratch/$USER/run-dir + setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir ++
+
Launch your graphical application: -
module load the required version, then matlab, or abaqus cme, etc.
Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local +
Launch your graphical application: +
module load the required version, then matlab, or abaqus cme, etc.
+ Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local
installation. You can make a similar install under your own directory. If using VSCode, it’s currently
-only supported with the --no-sandbox option.
+only supported with the --no-sandbox option.
+
BASH version:
-+bash-3.2$ ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too) serguei@speed’s password: [serguei@speed-submit ~] % echo $DISPLAY localhost:14.0 -[serguei@speed-submit ~] % srun -p ps --pty --x11=first --mem 4000 -t 0-06:00 /encs/bin/bash +[serguei@speed-submit ~] % salloc -p ps --x11=first --mem=4Gb -t 0-06:00 bash-4.4$ echo $DISPLAY localhost:77.0 bash-4.4$ hostname speed-01.encs.concordia.ca -bash-4.4$ unset XDG_RUNTIME_DIR +bash-4.4$ export XDG_RUNTIME_DIR=/speed-scratch/$USER/run-dir bash-4.4$ /speed-scratch/nag-public/bin/pycharm.sh-+
+
TCSH version: + + + +
++ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too) +[speed-submit] [/home/c/carlos] > echo $DISPLAY +localhost:14.0 +[speed-submit] [/home/c/carlos] > cd /speed-scratch/$USER +[speed-submit] [/speed-scratch/carlos] > echo $DISPLAY +localhost:13.0 +[speed-submit] [/speed-scratch/carlos] > salloc -pps --x11=first --mem=4Gb -t 0-06:00 +[speed-07] [/speed-scratch/carlos] > echo $DISPLAY +localhost:42.0 +[speed-07] [/speed-scratch/carlos] > hostname +speed-07.encs.concordia.ca +[speed-07] [/speed-scratch/carlos] > setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir +[speed-07] [/speed-scratch/carlos] > /speed-scratch/nag-public/bin/pycharm.sh ++-
2.9 Scheduler Environment Variables
-The scheduler presents a number of environment variables that can be used in your jobs. You can +
2.8.4 VScode
+This is an example of running VScode, it’s similar to Jupyter notebooks, but it doesn’t use containers. +This a Web version, it exists the local(workstation)-remote(speed-node) version too, but it is for +Advanced users (no support, execute it at your own risk). +
+
Environment preparation: for the FIRST time: +
Running VScode +
Run VScode, change the port if needed. + + + +
++ /speed-scratch/nag-public/code-server-4.22.1/bin/code-server --user-data-dir=$PWD\/projects \ + --config=$PWD\/home/.config/code-server/config.yaml --bind-addr="0.0.0.0:8080" $PWD\/projects ++
+
If the browser asks for password: + + + +
++ cat /speed-scratch/$USER/vscode/home/.config/code-server/config.yaml ++
+
+The scheduler presents a number of environment variables that can be used in your jobs. You can invoke env or printenv in your job to know what hose are (most begin with the prefix SLURM). Some of the more useful ones are:
@@ -1155,48 +1265,48 @@See a more complete list here: +
See a more complete list here:
In Figure 8 is a sample script, using some of these. +
In Figure 9 is a sample script, using some of these.
-Some programs effect their parallel processing via MPI (which is a communication protocol). An example of such software is Fluent. MPI needs to have ‘passwordless login’ set up, which means SSH keys. In your NFS-mounted home directory: @@ -1214,7 +1324,7 @@
-
The following documentation is specific to the Speed HPC Facility at the Gina Cody School of Engineering and Computer Science. Virtual environments typically instantiated via Conda or Python. Another option is Singularity detailed in Section 2.16. Usually, virtual environments are created once @@ -1223,7 +1333,7 @@
Request an interactive session in the queue you wish to submit your jobs to (e.g., salloc -p pg –gpus=1 for GPU jobs). Once your interactive has started, create an anaconda environment in your speed-scratch directory by using the prefix option when executing conda create. For example, @@ -1236,7 +1346,7 @@
+module load anaconda3/2023.03/default conda create --prefix /speed-scratch/a_user/myconda@@ -1244,13 +1354,13 @@2.11.1
Note: Without the prefix option, the conda create command creates the environment in a_user’s home directory by default.
-List Environments. To view your conda environments, type: conda info --envs
-+# conda environments: # base * /encs/pkg/anaconda3-2023.03/root @@ -1258,13 +1368,13 @@2.11.1
-
Activate an Environment. Activate the environment speedscratcha_usermyconda as follows
-+conda activate /speed-scratch/a_user/mycondaAfter activating your environment, add pip to your environment by using @@ -1272,7 +1382,7 @@
2.11.1 -
+conda install pipThis will install pip and pip’s dependencies, including python, into the environment. @@ -1284,7 +1394,7 @@
2.11.1 -
+salloc -p pg --gpus=1 --mem=10GB -A <slurm account name> cd /speed-scratch/$USER module load python/3.11.0/default @@ -1304,7 +1414,7 @@2.11.1 conda install installs modules from anaconda’s repository.
-
2.11.2 Python
+2.11.2 Python
Setting up a Python virtual environment is fairly straightforward. The first step is to request an interactive session in the queue you wish to submit your jobs to.
We have a simple example that use a Python virtual environment: @@ -1316,7 +1426,7 @@
2.11.2 -
+salloc -p pg --gpus=1 --mem=10GB -A <slurm account name> cd /speed-scratch/$USER module load python/3.9.1/default @@ -1336,56 +1446,56 @@2.11.2 --gpus= when preparing environments for CPU jobs.
-
2.12 Example Job Script: Fluent
+2.12 Example Job Script: Fluent
-The job script in Figure 9 runs Fluent in parallel over 32 cores. Of note, we have requested +
The job script in Figure 10 runs Fluent in parallel over 32 cores. Of note, we have requested e-mail notifications (--mail-type), are defining the parallel environment for, fluent, with, -t$SLURM_NTASKS and -g-cnf=$FLUENTNODES (very important), and are setting $TMPDIR as the in-job location for the “moment” rfile.out file (in-job, because the last line of the @@ -1395,7 +1505,7 @@
Caveat: take care with journal-file file paths. -
2.13 Example Job: efficientdet
+2.13 Example Job: efficientdet
The following steps describing how to create an efficientdet environment on Speed, were submitted by a member of Dr. Amer’s research group.
@@ -1416,7 +1526,7 @@+
pip install tensorflow==2.7.0 pip install lxml>=4.6.1 pip install absl-py>=0.10.0 @@ -1435,7 +1545,7 @@
-
2.14 Java Jobs
+2.14 Java Jobs
Jobs that call java have a memory overhead, which needs to be taken into account when assigning a value to --mem. Even the most basic java call, java -Xmx1G -version, will need to have, --mem=5G, with the 4-GB difference representing the memory overhead. Note that this memory @@ -1444,7 +1554,7 @@
2.14 314G.
-
2.15 Scheduling On The GPU Nodes
+2.15 Scheduling On The GPU Nodes
The primary cluster has two GPU nodes, each with six Tesla (CUDA-compatible) P6 cards: each card has 2048 cores and 16GB of RAM. Though note that the P6 is mainly a single-precision card, so unless you need the GPU double precision, double-precision calculations will be faster on a CPU @@ -1455,7 +1565,7 @@
+
#SBATCH --gpus=[1|2]@@ -1465,7 +1575,7 @@
+
sbatch -p pg ./<myscript>.sh@@ -1474,7 +1584,7 @@
+
ssh <username>@speed[-05|-17|37-43] nvidia-smi@@ -1483,7 +1593,7 @@
+
sinfo -p pg --long --Node@@ -1499,7 +1609,7 @@
+
[serguei@speed-submit src] % sinfo -p pg --long --Node Thu Oct 19 22:31:04 2023 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON @@ -1527,7 +1637,7 @@+
[serguei@speed-submit src] % squeue -p pg -o "%15N %.6D %7P %.11T %.4c %.8z %.6m %.8d %.6w %.8f %20G %20E" NODELIST NODES PARTITI STATE MIN_ S:C:T MIN_ME MIN_TMP_ WCKEY FEATURES GROUP DEPENDENCY speed-05 1 pg RUNNING 1 *:*:* 1G 0 (null) (null) 11929 (null) @@ -1540,7 +1650,7 @@
-
2.15.1 P6 on Multi-GPU, Multi-Node
+2.15.1 P6 on Multi-GPU, Multi-Node
As described lines above, P6 cards are not compatible with Distribute and DataParallel functions (Pytorch, Tensorflow) when running on Multi-GPUs. One workaround is to run the job in Multi-node, single GPU per node; per example: @@ -1548,7 +1658,7 @@
+
#SBATCH --nodes=2 #SBATCH --gpus-per-node=1@@ -1558,7 +1668,7 @@-
2.15.2 CUDA
+2.15.2 CUDA
When calling CUDA within job scripts, it is important to create a link to the desired CUDA libraries and set the runtime link path to the same libraries. For example, to use the cuda-11.5 libraries, specify the following in your Makefile. @@ -1566,7 +1676,7 @@
2.15.2 -
or module load gcc/9.3+-L/encs/pkg/cuda-11.5/root/lib64 -Wl,-rpath,/encs/pkg/cuda-11.5/root/lib64@@ -1574,14 +1684,14 @@
2.15.2 load gcc/8.4
-
2.15.3 Special Notes for sending CUDA jobs to the GPU Queue
+2.15.3 Special Notes for sending CUDA jobs to the GPU Queue
Interactive jobs (Section 2.8) must be submitted to the GPU partition in order to compile and link. We have several versions of CUDA installed in:
-+with one of the above./encs/pkg/cuda-11.5/root/ /encs/pkg/cuda-10.2/root/ /encs/pkg/cuda-9.2/root @@ -1591,15 +1701,15 @@usrlocalcuda
-
2.15.4 OpenISS Examples
+2.15.4 OpenISS Examples
These represent more comprehensive research-like examples of jobs for computer vision and other tasks with a lot longer runtime (a subject to the number of epochs and other parameters) derive from the actual research works of students and their theses. These jobs require the use of CUDA and GPUs. These examples are available as “native” jobs on Speed and as Singularity containers.
-OpenISS and REID + The example openiss-reid-speed.sh illustrates a job for a computer-vision based person re-identification (e.g., motion capture-based tracking for stage performance) part of the OpenISS project by Haotao Lai [10] using TensorFlow and Keras. The fork of the original repo [12] adjusted to @@ -1614,8 +1724,8 @@
2.15 -
OpenISS and YOLOv3 + The related code using YOLOv3 framework is in the the fork of the original repo [11] adjusted to to run on Speed is here:
@@ -1636,7 +1746,7 @@2.15
https://github.com/NAG-DevOps/speed-hpc/tree/master/src#openiss-yolov3 -
2.16 Singularity Containers
+2.16 Singularity Containers
If the /encs software tree does not have a required software instantaneously available, another option is to run Singularity containers. We run EL7 flavor of Linux, and if some projects require Ubuntu or other distributions, there is a possibility to run that software as a container, including the ones @@ -1668,7 +1778,7 @@
2 -
+/speed-scratch/nag-public: openiss-cuda-conda-jupyter.sif @@ -1710,7 +1820,7 @@2 -
+salloc --gpus=1 -n8 --mem=4Gb -t60 cd /speed-scratch/$USER/ singularity pull openiss-cuda-devicequery.sif docker://openiss/openiss-cuda-devicequery @@ -1723,14 +1833,14 @@2 example.
-
3 Conclusion
+3 Conclusion
The cluster is, “first come, first served”, until it fills, and then job position in the queue is based upon past usage. The scheduler does attempt to fill gaps, though, so sometimes a single-core job of lower priority will schedule before a multi-core job of higher priority, for example.
-
3.1 Important Limitations
+3.1 Important Limitations
- New users are restricted to a total of 32 cores: write to rt-ex-hpc@encs.concordia.ca if you need more temporarily (192 is the maximum, or, 6 jobs of 32 cores each). @@ -1761,7 +1871,7 @@
3.
-
3.2 Tips/Tricks
+3.2 Tips/Tricks
- Files/scripts must have Linux line breaks in them (not Windows ones). Use file command to verify; and dos2unix command to convert. @@ -1784,7 +1894,7 @@
3.2
- E-mail, rt-ex-hpc AT encs.concordia.ca, with any concerns/questions.
-
3.3 Use Cases
+3.3 Use Cases
HPC Committee’s initial batch about 6 students (end of 2019):
@@ -1863,10 +1973,10 @@3.3
-
A History
+A History
-
A.1 Acknowledgments
+A.1 Acknowledgments
- The first 6 (to 6.5) versions of this manual and early UGE job script samples, Singularity testing and user support were produced/done by Dr. Scott Bunnell during his time at @@ -1876,14 +1986,14 @@
A.1
- Dr. Tariq Daradkeh, was our IT Instructional Specialist August 2022 to September 2023; working on the scheduler, scheduling research, end user support, and integration - of examples, such as YOLOv3 in Section 2.15.4.0 other tasks. We have a continued + of examples, such as YOLOv3 in Section 2.15.4.0 other tasks. We have a continued collaboration on HPC/scheduling research.
-
A.2 Migration from UGE to SLURM
+A.2 Migration from UGE to SLURM
For long term users who started off with Grid Engine here are some resources to make a transition and mapping to the job submission process.
@@ -1895,7 +2005,7 @@+
GE => SLURM s.q ps g.q pg @@ -1905,12 +2015,12 @@-
Commands and command options mappings are found in Figure 10 from
https://slurm.schedmd.com/rosetta.pdf
https://slurm.schedmd.com/pdfs/summary.pdf
Other related helpful resources from similar organizations who either used SLURM for awhile or +Commands and command options mappings are found in Figure 11 from
https://slurm.schedmd.com/rosetta.pdf
https://slurm.schedmd.com/pdfs/summary.pdf
Other related helpful resources from similar organizations who either used SLURM for awhile or also transitioned to it:
https://docs.alliancecan.ca/wiki/Running_jobs
https://www.depts.ttu.edu/hpcc/userguides/general_guides/Conversion_Table_1.pdf
https://docs.mpcdf.mpg.de/doc/computing/clusters/aux/migration-from-sge-to-slurm@@ -1922,7 +2032,7 @@ +
# Speed environment set up if ($HOSTNAME == speed-submit.encs.concordia.ca) then source /local/pkg/uge-8.6.3/root/default/common/settings.csh @@ -1934,7 +2044,7 @@+
# Speed environment set up if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then . /local/pkg/uge-8.6.3/root/default/common/settings.sh @@ -1948,21 +2058,21 @@-
A.3 Phases
+A.3 Phases
Brief summary of Speed evolution phases.
-
A.3.1 Phase 4
+A.3.1 Phase 4
Phase 4 had 7 SuperMicro servers with 4x A100 80GB GPUs each added, dubbed as “SPEED2”. We also moved from Grid Engine to SLURM.
-
A.3.2 Phase 3
+A.3.2 Phase 3
Phase 3 had 4 vidpro nodes added from Dr. Amer totalling 6x P6 and 6x V100 GPUs added.
-
A.3.3 Phase 2
+A.3.3 Phase 2
Phase 2 saw 6x NVIDIA Tesla P6 added and 8x more compute nodes. The P6s replaced 4x of FirePro S7150. @@ -1970,7 +2080,7 @@
A.3.3
-
A.3.4 Phase 1
+A.3.4 Phase 1
Phase 1 of Speed was of the following configuration:
@@ -1981,20 +2091,20 @@
A.3.4
-
B Frequently Asked Questions
+B Frequently Asked Questions
-
B.1 Where do I learn about Linux?
+B.1 Where do I learn about Linux?
All Speed users are expected to have a basic understanding of Linux and its commonly used commands.
-
Software Carpentry
+Software Carpentry
Software Carpentry provides free resources to learn software, including a workshop on the Unix shell. https://software-carpentry.org/lessons/
-
Udemy
+Udemy
There are a number of Udemy courses, including free ones, that will assist you in learning Linux. Active Concordia faculty, staff and students have access to Udemy courses. The course Linux Mastery: Master the Linux Command Line in 11.5 Hours is a good starting point for @@ -2005,25 +2115,25 @@
Udemy
-
B.2 How to use the “bash shell” on Speed?
+B.2 How to use the “bash shell” on Speed?
This section describes how to use the “bash shell” on Speed. Review Section 2.1.2 to ensure that your bash environment is set up.
-
B.2.1 How do I set bash as my login shell?
+B.2.1 How do I set bash as my login shell?
In order to set your default login shell to bash on Speed, your login shell on all GCS servers must be changed to bash. To make this change, create a ticket with the Service Desk (or email help at concordia.ca) to request that bash become your default login shell for your ENCS user account on all GCS servers.
-
B.2.2 How do I move into a bash shell on Speed?
+B.2.2 How do I move into a bash shell on Speed?
To move to the bash shell, type bash at the command prompt. For example:
-+after entering the bash shell.[speed-submit] [/home/a/a_user] > bash bash-4.4$ echo $0 bash @@ -2033,7 +2143,7 @@
bash-4.4$ -
B.2.3 How do I use the bash shell in an interactive session on Speed?
+B.2.3 How do I use the bash shell in an interactive session on Speed?
Below are examples of how to use bash as a shell in your interactive job sessions with both the salloc and srun commands.
@@ -2043,41 +2153,41 @@srun --mem=50G -n 5 --pty /encs/bin/bash
Note: Make sure the interactive job requests memory, cores, etc.
-B.2.4 How do I run scripts written in bash on Speed?
+B.2.4 How do I run scripts written in bash on Speed?
To execute bash scripts on Speed:
-
+- Ensure that the shebang of your bash job script is #!/encs/bin/bash +
- Ensure that the shebang of your bash job script is #!/encs/bin/bash
-- Use the sbatch command to submit your job script to the scheduler.
Use the sbatch command to submit your job script to the scheduler. The Speed GitHub contains a sample bash job script.
-
B.3 How to resolve “Disk quota exceeded” errors?
+B.3 How to resolve “Disk quota exceeded” errors?
-
B.3.1 Probable Cause
+B.3.1 Probable Cause
The “Disk quota exceeded” Error occurs when your application has run out of disk space to write to. On Speed this error can be returned when:
-
- Your NFS-provided home is full and cannot be written to. You can verify this using quota +
- Your NFS-provided home is full and cannot be written to. You can verify this using quota and bigfiles commands.
-- The /tmp directory on the speed node your application is running on is full and cannot +
- The /tmp directory on the speed node your application is running on is full and cannot be written to.
-
B.3.2 Possible Solutions
+B.3.2 Possible Solutions
-
- Use the --chdir job script option to set the directory that the job script is submitted +
- Use the --chdir job script option to set the directory that the job script is submitted from the job working directory. The job working directory is the directory that the job will write output files in.
-- +
The use local disk space is generally recommended for IO intensive operations. However, as the size of /tmp on speed nodes is 1TB it can be necessary for scripts to store temporary data elsewhere. Review the documentation for each module called within your script to determine @@ -2098,7 +2208,7 @@
B. -
+mkdir -m 750 /speed-scratch/$USER/output@@ -2110,7 +2220,7 @@B. -
+mkdir -m 750 /speed-scratch/$USER/recovery@@ -2121,7 +2231,7 @@
B.
In the above example, $USER is an environment variable containing your ENCS username.
-
B.3.3 Example of setting working directories for COMSOL
+B.3.3 Example of setting working directories for COMSOL
Create directories for recovery, temporary, and configuration files. For example, to create these @@ -2130,7 +2240,7 @@
+
mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config}@@ -2142,7 +2252,7 @@
+
-recoverydir /speed-scratch/$USER/comsol/recovery -tmpdir /speed-scratch/$USER/comsol/tmp -configuration/speed-scratch/$USER/comsol/config @@ -2151,7 +2261,7 @@In the above example, $USER is an environment variable containing your ENCS username.
-
B.3.4 Example of setting working directories for Python Modules
+B.3.4 Example of setting working directories for Python Modules
By default when adding a python module the /tmp directory is set as the temporary repository for files downloads. The size of the /tmp directory on speed-submit is too small for pytorch. To add a python module
@@ -2162,7 +2272,7 @@+
mkdir /speed-scratch/$USER/tmp@@ -2173,7 +2283,7 @@
+
setenv TMPDIR /speed-scratch/$USER/tmp@@ -2182,17 +2292,17 @@
In the above example, $USER is an environment variable containing your ENCS username.
-
B.4 How do I check my job’s status?
+B.4 How do I check my job’s status?
When a job with a job id of 1234 is running or terminated, the status of that job can be tracked using ‘sacct -j 1234’. squeue -j 1234 can show while the job is sitting in the queue as well. Long term statistics on the job after its terminated can be found using sstat -j 1234 after slurmctld purges it its tracking state into the database.
-
B.5 Why is my job pending when nodes are empty?
+B.5 Why is my job pending when nodes are empty?
-
B.5.1 Disabled nodes
+B.5.1 Disabled nodes
It is possible that one or a number of the Speed nodes are disabled. Nodes are disabled if they require maintenance. To verify if Speed nodes are disabled, see if they are in a draining or drained state: @@ -2200,7 +2310,7 @@
B.5.1 -
+[serguei@speed-submit src] % sinfo --long --Node Thu Oct 19 21:25:12 2023 NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON @@ -2249,7 +2359,7 @@B.5.1 and the disabled nodes have a state of idle.
-
B.5.2 Error in job submit request.
+B.5.2 Error in job submit request.
It is possible that your job is pending, because the job requested resources that are not available within Speed. To verify why job id 1234 is not running, execute ‘sacct -j 1234’. A summary of the reasons is available via the squeue command. @@ -2258,7 +2368,7 @@
C Sister Facilities +
C Sister Facilities
Below is a list of resources and facilities similar to Speed at various capacities. Depending on your research group and needs, they might be available to you. They are not managed by HPC/NAG of AITS, so contact their respective representatives. @@ -2313,8 +2423,8 @@
C -
References
+References