forked from NAG-DevOps/speed-hpc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
scheduler-faq.tex
273 lines (232 loc) · 13.4 KB
/
scheduler-faq.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
% -----------------------------------------------------------------------------
% B Frequently Asked Questions
% -----------------------------------------------------------------------------
\section{Frequently Asked Questions}
\label{sect:faqs}
% B.1 Where do I learn about Linux?
% -----------------------------------------------------------------------------
\subsection{Where do I learn about Linux?}
\label{sect:faqs-linux}
All Speed users are expected to have a basic understanding of Linux and its commonly used commands.
Here are some recommended resources:
% -----------------------------------------------------------------------------
\paragraph*{Software Carpentry}
Software Carpentry provides free resources to learn software, including a workshop on the Unix shell.
Visit \href{https://software-carpentry.org/lessons/}{Software Carpentry Lessons} to learn more.
% -----------------------------------------------------------------------------
\paragraph*{Udemy}
There are numerous Udemy courses, including free ones, that will help you learn Linux.
Active Concordia faculty, staff and students have access to Udemy courses.
A recommended starting point for beginners is the course ``Linux Mastery: Master the Linux Command Line in 11.5 Hours''.
Visit \href{https://www.concordia.ca/it/services/udemy.html}{Concordia's Udemy page} to learn how Concordians can access Udemy.
% B.2 How to bash shell on Speed?
% -----------------------------------------------------------------------------
\subsection{How to use bash shell on Speed?}
\label{sect:faqs-bash}
This section provides comprehensive instructions on how to utilize the bash shell on the Speed cluster.
% B.2.1 How do I set bash as my login shell?
% -----------------------------------------------------------------------------
\subsubsection{How do I set bash as my login shell?}
To set your default login shell to bash on Speed, your login shell on all GCS servers must be changed to bash.
To make this change, create a ticket with the Service Desk (or email \texttt{help at concordia.ca}) to
request that bash become your default login shell for your ENCS user account on all GCS servers.
% B.2.2 How do I move into a bash shell on Speed?
% -----------------------------------------------------------------------------
\subsubsection{How do I move into a bash shell on Speed?}
To move to the bash shell, type \textbf{bash} at the command prompt:
\begin{verbatim}
[speed-submit] [/home/a/a_user] > bash
bash-4.4$ echo $0
bash
\end{verbatim}
\noindent
\textbf{Note} how the command prompt changes from
``\verb![speed-submit] [/home/a/a_user] >!'' to ``\verb!bash-4.4$!'' after entering the bash shell.
% B.2.3 How do I use the bash shell in an interactive session on Speed?
% -----------------------------------------------------------------------------
\subsubsection{How do I use the bash shell in an interactive session on Speed?}
Below are examples of how to use \tool{bash} as a shell in your interactive job sessions
with both the \tool{salloc} and \tool{srun} commands.
\begin{itemize}
\item \texttt{salloc -ppt --mem=100G -N 1 -n 10 /encs/bin/bash}
\item \texttt{srun --mem=50G -n 5 --pty /encs/bin/bash}
\end{itemize}
\noindent\textbf{Note:} Make sure the interactive job requests memory, cores, etc.
% B.2.4 How do I run scripts written in bash on Speed?
% -----------------------------------------------------------------------------
\subsubsection{How do I run scripts written in bash on \tool{Speed}?}
To execute bash scripts on Speed:
\begin{enumerate}
\item Ensure that the shebang of your bash job script is \verb+#!/encs/bin/bash+
\item Use the \tool{sbatch} command to submit your job script to the scheduler.
\end{enumerate}
\noindent Check Speed GitHub for a
\href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{sample bash job script}.
% B.3 How to resolve “Disk quota exceeded” errors?
% -------------------------------------------------------------
\subsection{How to resolve ``Disk quota exceeded'' errors?}
\label{sect:quota-exceeded}
% B.3.1 Probable Cause
% -----------------------------------------------------------------------------
\subsubsection{Probable Cause}
The ``\texttt{Disk quota exceeded}'' error occurs when your application has
run out of disk space to write to. On \tool{Speed}, this error can be returned when:
\begin{enumerate}
\item The NFS-provided home is full and cannot be written to.
You can verify this using the \tool{quota} and \tool{bigfiles} commands.
\item The ``\texttt{/tmp}'' directory on the speed node where your application is running is full and cannot be written to.
\end{enumerate}
% B.3.2 Possible Solutions
% -----------------------------------------------------------------------------
\subsubsection{Possible Solutions}
\begin{enumerate}
\item Use the \option{--chdir} job script option to set the job working directory.
This is the directory where the job will write output files.
\item Although local disk space is recommended for IO-intensive operations, the
`\texttt{/tmp}' directory on \tool{Speed} nodes is limited to 1TB, so it may be necessary
to store temporary data elsewhere. Review the documentation for each module
used in your script to determine how to set working directories.
The basic steps are:
\begin{itemize}
\item
Determine how to set working directories for each module used in your job script.
\item
Create a working directory in \tool{speed-scratch} for output files:
\begin{verbatim}
mkdir -m 750 /speed-scratch/$USER/output
\end{verbatim}
\item
Create a subdirectory for recovery files:
\begin{verbatim}
mkdir -m 750 /speed-scratch/$USER/recovery
\end{verbatim}
\item
Update the job script to write output to the directories created in your
\tool{speed-scratch} directory, e.g., \verb!/speed-scratch/$USER/output!.
\end{itemize}
\end{enumerate}
\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.
% B.3.3 Example of setting working directories for COMSOL
% -----------------------------------------------------------------------------
\subsubsection{Example of setting working directories for \tool{COMSOL}}
\begin{itemize}
\item Create directories for recovery, temporary, and configuration files.
\begin{verbatim}
mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config}
\end{verbatim}
\item Add the following command switches to the COMSOL command to use the
directories created above:
\begin{verbatim}
-recoverydir /speed-scratch/$USER/comsol/recovery
-tmpdir /speed-scratch/$USER/comsol/tmp
-configuration/speed-scratch/$USER/comsol/config
\end{verbatim}
\end{itemize}
\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.
% B.3.4 Example of setting working directories for Python Modules
% -----------------------------------------------------------------------------
\subsubsection{Example of setting working directories for \tool{Python Modules}}
By default when adding a Python module, the \texttt{/tmp} directory is set as the temporary repository for files downloads.
The size of the \texttt{/tmp} directory on \verb!speed-submit! is too small for PyTorch.
To add a Python module
\begin{itemize}
\item Create your own tmp directory in your \verb!speed-scratch! directory:
\begin{verbatim}
mkdir /speed-scratch/$USER/tmp
\end{verbatim}
\item Use the temporary directory you created
\begin{verbatim}
setenv TMPDIR /speed-scratch/$USER/tmp
\end{verbatim}
\item Attempt the installation of PyTorch
\end{itemize}
\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.
% B.4 How do I check my job's status?
% -----------------------------------------------------------------------------
\subsection{How do I check my job's status?}
%When a job with a job id of 1234 is running, the status of that job can be tracked using \verb!`qstat -j 1234`!.
%Likewise, if the job is pending, the \verb!`qstat -j 1234`! command will report as to why the job is not scheduled or running.
%Once the job has finished, or has been killed, the \textbf{qacct} command must be used to query the job's status, e.g., \verb!`qaact -j [jobid]`!.
When a job with a job ID of 1234 is running or terminated,
you can track its status using the following commands:
\begin{itemize}
\item Use the ``sacct'' command to view the status of a job:
\begin{verbatim}
sacct -j 1234
\end{verbatim}
\item Use the ``squeue'' command to see if the job is sitting in the queue:
\begin{verbatim}
squeue -j 1234
\end{verbatim}
\item Use the ``sstat'' command to find long-term statistics on the job after it has terminated
and the \tool{slurmctld} has purged it from its tracking state into the database:
\begin{verbatim}
sstat -j 1234
\end{verbatim}
\end{itemize}
% B.5 Why is my job pending when nodes are empty?
% -----------------------------------------------------------------------------
\subsection{Why is my job pending when nodes are empty?}
% B.5.1 Disabled nodes
% -----------------------------------------------------------------------------
\subsubsection{Disabled nodes}
It is possible that one or more of the Speed nodes are disabled for maintenance.
To verify if Speed nodes are disabled, check if they are in a draining or drained state:
\small
\begin{verbatim}
[serguei@speed-submit src] % sinfo --long --Node
Thu Oct 19 21:25:12 2023
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
speed-01 1 pa idle 32 2:16:1 257458 0 1 gpu16 none
speed-03 1 pa idle 32 2:16:1 257458 0 1 gpu32 none
speed-05 1 pg idle 32 2:16:1 515490 0 1 gpu16 none
speed-07 1 ps* mixed 32 2:16:1 515490 0 1 cpu32 none
speed-08 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-09 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-10 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-11 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-12 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-15 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-16 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-17 1 pg drained 32 2:16:1 515490 0 1 gpu16 UGE
speed-19 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-20 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-21 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-22 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-23 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-24 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-25 1 pg idle 32 2:16:1 257458 0 1 gpu32 none
speed-25 1 pa idle 32 2:16:1 257458 0 1 gpu32 none
speed-27 1 pg idle 32 2:16:1 257458 0 1 gpu32 none
speed-27 1 pa idle 32 2:16:1 257458 0 1 gpu32 none
speed-29 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-30 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-31 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-32 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-33 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-34 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-35 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-36 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-37 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-38 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-39 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-40 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-41 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-42 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-43 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
\end{verbatim}
\normalsize
\noindent Note which nodes are in the state of \textbf{drained}.
The reason for the drained state can be found in the \textbf{reason} column.\\
\noindent Your job will run once an occupied node becomes availble or the maintenance is completed,
and the disabled nodes have a state of \textbf{idle}.
% B.5.2 Error in job submit request.
% -----------------------------------------------------------------------------
\subsubsection{Error in job submit request.}
It is possible that your job is pending because it requested resources that are not available within Speed.
To verify why job ID 1234 is not running, execute:
\begin{verbatim}
sacct -j 1234
\end{verbatim}
\noindent
A summary of the reasons can be obtained via the \tool{squeue} command.