forked from NAG-DevOps/speed-hpc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
5767 lines (5019 loc) · 415 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang='en-US' xml:lang='en-US'>
<head> <title>Speed: The GCS ENCS Cluster</title>
<meta charset='utf-8' />
<meta content='TeX4ht (https://tug.org/tex4ht/)' name='generator' />
<meta content='width=device-width,initial-scale=1' name='viewport' />
<link href='speed-manual.css' rel='stylesheet' type='text/css' />
<meta content='speed-manual.tex' name='src' />
<script>window.MathJax = { tex: { tags: "ams", }, }; </script>
<script async='async' id='MathJax-script' src='https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-chtml-full.js' type='text/javascript'></script>
<link href='https://latex.now.sh/style.css' rel='stylesheet' />
</head><body>
<div class='maketitle'>
<h2 class='titleHead'>Speed: The GCS ENCS Cluster</h2>
<div class='author'> Serguei A. Mokhov <br class='and' />Gillian A. Roper <br class='and' />Carlos Alarcón Meza <br class='and' />Farah Salhany <br class='and' />Network, Security and HPC Group<span class='thank-mark'><a href='#tk-1'><span class='tcrm-1000'>∗</span></a></span>
<br /> <span class='cmr-9'>Gina Cody School of Engineering and Computer Science</span>
<br /> <span class='cmr-9'>Concordia University</span>
<br /> <span class='cmr-9'>Montreal, Quebec, Canada</span>
<br /> <a class='url' href='rt-ex-hpc~AT~encs.concordia.ca'><span class='cmtt-9'>rt-ex-hpc~AT~encs.concordia.ca</span></a><br /></div><br />
<div class='date'><span class='cmbx-10'>Version 7.2</span></div>
<div class='thanks'><br /><a id='tk-1'></a><span class='thank-mark'><span class='tcrm-1000'>∗</span></span>The group acknowledges the initial manual version VI produced by Dr. Scott Bunnell while with
us as well as Dr. Tariq Daradkeh for his instructional support of the users and contribution of
examples.</div></div>
<section class='abstract' role='doc-abstract'>
<h3 class='abstracttitle'>
<span class='cmbx-9'>Abstract</span>
</h3>
<!-- l. 90 --><p class='noindent'><span class='cmr-9'>This document serves as a quick start guide to using the Gina Cody School of Engineering
and Computer Science (GCS ENCS) compute server farm, known as “Speed.” Managed by
the HPC/NAG group of the Academic Information Technology Services (AITS) at GCS,
Concordia University, Montreal, Canada.</span>
</p>
</section>
<h3 class='likesectionHead' id='contents'><a id='x1-1000'></a>Contents</h3>
<div class='tableofcontents'>
<span class='sectionToc'>1 <a href='#introduction' id='QQ2-1-2'>Introduction</a></span>
<br /> <span class='subsectionToc'>1.1 <a href='#citing-us' id='QQ2-1-3'>Citing Us</a></span>
<br /> <span class='subsectionToc'>1.2 <a href='#resources' id='QQ2-1-4'>Resources</a></span>
<br /> <span class='subsectionToc'>1.3 <a href='#team' id='QQ2-1-5'>Team</a></span>
<br /> <span class='subsectionToc'>1.4 <a href='#what-speed-consists-of' id='QQ2-1-6'>What Speed Consists of</a></span>
<br /> <span class='subsectionToc'>1.5 <a href='#what-speed-is-ideal-for' id='QQ2-1-10'>What Speed Is Ideal For</a></span>
<br /> <span class='subsectionToc'>1.6 <a href='#what-speed-is-not' id='QQ2-1-11'>What Speed Is Not</a></span>
<br /> <span class='subsectionToc'>1.7 <a href='#available-software' id='QQ2-1-12'>Available Software</a></span>
<br /> <span class='subsectionToc'>1.8 <a href='#requesting-access' id='QQ2-1-13'>Requesting Access</a></span>
<br /> <span class='sectionToc'>2 <a href='#job-management' id='QQ2-1-14'>Job Management</a></span>
<br /> <span class='subsectionToc'>2.1 <a href='#getting-started' id='QQ2-1-15'>Getting Started</a></span>
<br /> <span class='subsubsectionToc'>2.1.1 <a href='#ssh-connections' id='QQ2-1-16'>SSH Connections</a></span>
<br /> <span class='subsubsectionToc'>2.1.2 <a href='#environment-set-up' id='QQ2-1-17'>Environment Set Up</a></span>
<br /> <span class='subsectionToc'>2.2 <a href='#job-submission-basics' id='QQ2-1-18'>Job Submission Basics</a></span>
<br /> <span class='subsubsectionToc'>2.2.1 <a href='#directives' id='QQ2-1-19'>Directives</a></span>
<br /> <span class='subsubsectionToc'>2.2.2 <a href='#working-with-modules' id='QQ2-1-20'>Working with Modules</a></span>
<br /> <span class='subsubsectionToc'>2.2.3 <a href='#user-scripting' id='QQ2-1-21'>User Scripting</a></span>
<br /> <span class='subsectionToc'>2.3 <a href='#sample-job-script' id='QQ2-1-22'>Sample Job Script</a></span>
<br /> <span class='subsectionToc'>2.4 <a href='#common-job-management-commands-summary' id='QQ2-1-25'>Common Job Management Commands Summary</a></span>
<br /> <span class='subsectionToc'>2.5 <a href='#advanced-sbatch-options' id='QQ2-1-26'>Advanced <span class='cmtt-10'>sbatch </span>Options</a></span>
<br /> <span class='subsectionToc'>2.6 <a href='#array-jobs' id='QQ2-1-27'>Array Jobs</a></span>
<br /> <span class='subsectionToc'>2.7 <a href='#requesting-multiple-cores-ie-multithreading-jobs' id='QQ2-1-28'>Requesting Multiple Cores (i.e., Multithreading Jobs)</a></span>
<br /> <span class='subsectionToc'>2.8 <a href='#interactive-jobs' id='QQ2-1-29'>Interactive Jobs</a></span>
<br /> <span class='subsubsectionToc'>2.8.1 <a href='#command-line' id='QQ2-1-30'>Command Line</a></span>
<br /> <span class='subsubsectionToc'>2.8.2 <a href='#graphical-applications' id='QQ2-1-31'>Graphical Applications</a></span>
<br /> <span class='subsubsectionToc'>2.8.3 <a href='#jupyter-notebooks' id='QQ2-1-33'>Jupyter Notebooks</a></span>
<br /> <span class='paragraphToc'>2.8.3.1 <a href='#jupyter-notebook-in-singularity' id='QQ2-1-34'>Jupyter Notebook in Singularity</a></span>
<br /> <span class='paragraphToc'>2.8.3.2 <a href='#jupyterlab-in-conda-and-pytorch' id='QQ2-1-38'>JupyterLab in Conda and Pytorch</a></span>
<br /> <span class='paragraphToc'>2.8.3.3 <a href='#jupyterlab-pytorch-in-python-venv' id='QQ2-1-43'>JupyterLab + Pytorch in Python venv</a></span>
<br /> <span class='subsubsectionToc'>2.8.4 <a href='#visual-studio-code' id='QQ2-1-44'>Visual Studio Code</a></span>
<br /> <span class='subsectionToc'>2.9 <a href='#scheduler-environment-variables' id='QQ2-1-46'>Scheduler Environment Variables</a></span>
<br /> <span class='subsectionToc'>2.10 <a href='#ssh-keys-for-mpi' id='QQ2-1-49'>SSH Keys for MPI</a></span>
<br /> <span class='subsectionToc'>2.11 <a href='#creating-virtual-environments' id='QQ2-1-50'>Creating Virtual Environments</a></span>
<br /> <span class='subsubsectionToc'>2.11.1 <a href='#anaconda' id='QQ2-1-51'>Anaconda</a></span>
<br /> <span class='paragraphToc'>2.11.1.1 <a href='#conda-env-without-prefix' id='QQ2-1-52'>Conda Env without <span class='cmtt-10'>--prefix</span></a></span>
<br /> <span class='subsubsectionToc'>2.11.2 <a href='#python' id='QQ2-1-53'>Python</a></span>
<br /> <span class='subsectionToc'>2.12 <a href='#example-job-script-fluent' id='QQ2-1-54'>Example Job Script: Fluent</a></span>
<br /> <span class='subsectionToc'>2.13 <a href='#example-job-efficientdet' id='QQ2-1-57'>Example Job: EfficientDet</a></span>
<br /> <span class='subsectionToc'>2.14 <a href='#java-jobs' id='QQ2-1-58'>Java Jobs</a></span>
<br /> <span class='subsectionToc'>2.15 <a href='#scheduling-on-the-gpu-nodes' id='QQ2-1-59'>Scheduling on the GPU Nodes</a></span>
<br /> <span class='subsubsectionToc'>2.15.1 <a href='#p-on-multigpu-multinode' id='QQ2-1-60'>P6 on Multi-GPU, Multi-Node</a></span>
<br /> <span class='subsubsectionToc'>2.15.2 <a href='#cuda' id='QQ2-1-61'>CUDA</a></span>
<br /> <span class='subsubsectionToc'>2.15.3 <a href='#special-notes-for-sending-cuda-jobs-to-the-gpu-queues' id='QQ2-1-62'>Special Notes for Sending CUDA Jobs to the GPU Queues</a></span>
<br /> <span class='subsubsectionToc'>2.15.4 <a href='#openiss-examples' id='QQ2-1-63'>OpenISS Examples</a></span>
<br /> <span class='paragraphToc'>2.15.4.1 <a href='#openiss-and-reid' id='QQ2-1-64'>OpenISS and REID</a></span>
<br /> <span class='paragraphToc'>2.15.4.2 <a href='#openiss-and-yolov' id='QQ2-1-65'>OpenISS and YOLOv3</a></span>
<br /> <span class='subsectionToc'>2.16 <a href='#singularity-containers' id='QQ2-1-66'>Singularity Containers</a></span>
<br /> <span class='sectionToc'>3 <a href='#conclusion' id='QQ2-1-67'>Conclusion</a></span>
<br /> <span class='subsectionToc'>3.1 <a href='#important-limitations' id='QQ2-1-68'>Important Limitations</a></span>
<br /> <span class='subsectionToc'>3.2 <a href='#tipstricks' id='QQ2-1-69'>Tips/Tricks</a></span>
<br /> <span class='subsectionToc'>3.3 <a href='#use-cases' id='QQ2-1-70'>Use Cases</a></span>
<br /> <span class='sectionToc'>A <a href='#history' id='QQ2-1-71'>History</a></span>
<br /> <span class='subsectionToc'>A.1 <a href='#acknowledgments' id='QQ2-1-72'>Acknowledgments</a></span>
<br /> <span class='subsectionToc'>A.2 <a href='#migration-from-uge-to-slurm' id='QQ2-1-73'>Migration from UGE to SLURM</a></span>
<br /> <span class='subsectionToc'>A.3 <a href='#phases' id='QQ2-1-75'>Phases</a></span>
<br /> <span class='subsubsectionToc'>A.3.1 <a href='#phase-' id='QQ2-1-76'>Phase 5</a></span>
<br /> <span class='subsubsectionToc'>A.3.2 <a href='#phase-1' id='QQ2-1-77'>Phase 4</a></span>
<br /> <span class='subsubsectionToc'>A.3.3 <a href='#phase-2' id='QQ2-1-78'>Phase 3</a></span>
<br /> <span class='subsubsectionToc'>A.3.4 <a href='#phase-3' id='QQ2-1-79'>Phase 2</a></span>
<br /> <span class='subsubsectionToc'>A.3.5 <a href='#phase-4' id='QQ2-1-80'>Phase 1</a></span>
<br /> <span class='sectionToc'>B <a href='#frequently-asked-questions' id='QQ2-1-81'>Frequently Asked Questions</a></span>
<br /> <span class='subsectionToc'>B.1 <a href='#where-do-i-learn-about-linux' id='QQ2-1-82'>Where do I learn about Linux?</a></span>
<br /> <span class='subsectionToc'>B.2 <a href='#how-to-use-bash-shell-on-speed' id='QQ2-1-85'>How to use bash shell on Speed?</a></span>
<br /> <span class='subsubsectionToc'>B.2.1 <a href='#how-do-i-set-bash-as-my-login-shell' id='QQ2-1-86'>How do I set bash as my login shell?</a></span>
<br /> <span class='subsubsectionToc'>B.2.2 <a href='#how-do-i-move-into-a-bash-shell-on-speed' id='QQ2-1-87'>How do I move into a bash shell on Speed?</a></span>
<br /> <span class='subsubsectionToc'>B.2.3 <a href='#how-do-i-use-the-bash-shell-in-an-interactive-session-on-speed' id='QQ2-1-88'>How do I use the bash shell in an interactive session on Speed?</a></span>
<br /> <span class='subsubsectionToc'>B.2.4 <a href='#how-do-i-run-scripts-written-in-bash-on-speed' id='QQ2-1-89'>How do I run scripts written in bash on <span class='cmtt-10'>Speed</span>?</a></span>
<br /> <span class='subsectionToc'>B.3 <a href='#how-to-resolve-disk-quota-exceeded-errors' id='QQ2-1-90'>How to resolve “Disk quota exceeded” errors?</a></span>
<br /> <span class='subsubsectionToc'>B.3.1 <a href='#probable-cause' id='QQ2-1-91'>Probable Cause</a></span>
<br /> <span class='subsubsectionToc'>B.3.2 <a href='#possible-solutions' id='QQ2-1-92'>Possible Solutions</a></span>
<br /> <span class='subsubsectionToc'>B.3.3 <a href='#example-of-setting-working-directories-for-comsol' id='QQ2-1-93'>Example of setting working directories for <span class='cmtt-10'>COMSOL</span></a></span>
<br /> <span class='subsubsectionToc'>B.3.4 <a href='#example-of-setting-working-directories-for-python-modules' id='QQ2-1-94'>Example of setting working directories for <span class='cmtt-10'>Python Modules</span></a></span>
<br /> <span class='subsectionToc'>B.4 <a href='#how-do-i-check-my-jobs-status' id='QQ2-1-95'>How do I check my job’s status?</a></span>
<br /> <span class='subsectionToc'>B.5 <a href='#why-is-my-job-pending-when-nodes-are-empty' id='QQ2-1-96'>Why is my job pending when nodes are empty?</a></span>
<br /> <span class='subsubsectionToc'>B.5.1 <a href='#disabled-nodes' id='QQ2-1-97'>Disabled nodes</a></span>
<br /> <span class='subsubsectionToc'>B.5.2 <a href='#error-in-job-submit-request' id='QQ2-1-98'>Error in job submit request.</a></span>
<br /> <span class='sectionToc'>C <a href='#sister-facilities' id='QQ2-1-99'>Sister Facilities</a></span>
<br /> <span class='sectionToc'>D <a href='#software-installed-on-speed' id='QQ2-1-100'>Software Installed On Speed</a></span>
<br /> <span class='subsectionToc'>D.1 <a href='#el' id='QQ2-1-101'>EL7</a></span>
<br /> <span class='subsectionToc'>D.2 <a href='#el1' id='QQ2-1-102'>EL9</a></span>
<br /> <span class='sectionToc'><a href='#annotated-bibliography'>Annotated Bibliography</a></span>
</div>
<h3 class='sectionHead' id='introduction'><span class='titlemark'>1 </span> <a id='x1-20001'></a>Introduction</h3>
<!-- l. 105 --><p class='noindent'>This document contains basic information required to use “Speed”, along with tips, tricks, examples,
and references to projects and papers that have used Speed. User contributions of sample jobs and/or
references are welcome.<br class='newline' />
</p><!-- l. 109 --><p class='noindent'><span class='cmbx-10'>Note: </span>On October 20, 2023, we completed the migration to SLURM from Grid Engine (UGE/AGE)
as our job scheduler. This manual has been updated to use SLURM’s syntax and commands. If you
are a long-time GE user, refer to Appendix <a href='#migration-from-uge-to-slurm'>A.2<!-- tex4ht:ref: appdx:uge-to-slurm --></a> for key highlights needed to translate your GE jobs to
SLURM as well as environment changes. These changes are also elaborated throughout this document
and our examples.
</p><!-- l. 118 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='citing-us'><span class='titlemark'>1.1 </span> <a id='x1-30001.1'></a>Citing Us</h4>
<!-- l. 121 --><p class='noindent'>If you wish to cite this work in your acknowledgements, you can use our general DOI found on our
GitHub page <a class='url' href='https://dx.doi.org/10.5281/zenodo.5683642'><span class='cmtt-10'>https://dx.doi.org/10.5281/zenodo.5683642</span></a> or a specific version of the manual and
scripts from that link individually. You can also use the “cite this repository” feature of
GitHub.
</p><!-- l. 127 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='resources'><span class='titlemark'>1.2 </span> <a id='x1-40001.2'></a>Resources</h4>
<ul class='itemize1'>
<li class='itemize'>
<!-- l. 132 --><p class='noindent'>Public GitHub page where the manual and sample job scripts are maintained at<br class='newline' /><a class='url' href='https://github.com/NAG-DevOps/speed-hpc'><span class='cmtt-10'>https://github.com/NAG-DevOps/speed-hpc</span></a> </p>
<ul class='itemize2'>
<li class='itemize'>Pull requests (PRs) are subject to review and are welcome:<br class='newline' /><a class='url' href='https://github.com/NAG-DevOps/speed-hpc/pulls'><span class='cmtt-10'>https://github.com/NAG-DevOps/speed-hpc/pulls</span></a></li></ul>
</li>
<li class='itemize'>
<!-- l. 140 --><p class='noindent'>Speed Manual: </p>
<ul class='itemize2'>
<li class='itemize'>PDF version of the manual:<br class='newline' /><a class='url' href='https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/speed-manual.pdf'><span class='cmtt-10'>https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/speed-manual.pdf</span></a>
</li>
<li class='itemize'>HTML version of the manual:<br class='newline' /><a class='url' href='https://nag-devops.github.io/speed-hpc/'><span class='cmtt-10'>https://nag-devops.github.io/speed-hpc/</span></a></li></ul>
</li>
<li class='itemize'>Concordia official page for “Speed” cluster, which includes access request instructions.
<a class='url' href='https://www.concordia.ca/ginacody/aits/speed.html'><span class='cmtt-10'>https://www.concordia.ca/ginacody/aits/speed.html</span></a>
</li>
<li class='itemize'>All Speed users are subscribed to the <span class='cmtt-10'>hpc-ml </span>mailing list.
</li></ul>
<!-- l. 168 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='team'><span class='titlemark'>1.3 </span> <a id='x1-50001.3'></a>Team</h4>
<!-- l. 171 --><p class='noindent'>Speed is supported by: </p>
<ul class='itemize1'>
<li class='itemize'>Serguei Mokhov, PhD, Manager, Networks, Security and HPC, AITS
</li>
<li class='itemize'>Gillian Roper, Senior Systems Administrator, HPC, AITS
</li>
<li class='itemize'>Carlos Alarcón Meza, Systems Administrator, HPC and Networking, AITS
</li>
<li class='itemize'>Farah Salhany, IT Instructional Specialist, AITS</li></ul>
<!-- l. 183 --><p class='noindent'>We receive support from the rest of AITS teams, such as NAG, SAG, FIS, and DOG.<br class='newline' /><a class='url' href='https://www.concordia.ca/ginacody/aits.html'><span class='cmtt-10'>https://www.concordia.ca/ginacody/aits.html</span></a>
</p><!-- l. 189 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='what-speed-consists-of'><span class='titlemark'>1.4 </span> <a id='x1-60001.4'></a>What Speed Consists of</h4>
<ul class='itemize1'>
<li class='itemize'>Twenty four (24) 32-core compute nodes, each with 512 GB of memory and approximately
1 TB of local volatile-scratch disk space (pictured in Figure <a href='#-speed'>1<!-- tex4ht:ref: fig:speed-pics --></a>).
</li>
<li class='itemize'>Twelve (12) NVIDIA Tesla P6 GPUs, with 16 GB of GPU memory (compatible with the
CUDA, OpenGL, OpenCL, and Vulkan APIs).
</li>
<li class='itemize'>4 VIDPRO nodes (ECE. Dr. Amer), with 6 P6 cards, and 6 V100 cards (32GB), and
256GB of RAM.
</li>
<li class='itemize'>7 new SPEED2 servers with 256 CPU cores each 4x A100 80 GB GPUs, partitioned into
4x 20GB MIGs each; larger local storage for TMPDIR (see Figure <a href='#-speed-cluster-hardware-architecture'>2<!-- tex4ht:ref: fig:speed-architecture-full --></a>).
</li>
<li class='itemize'>One AMD FirePro S7150 GPU, with 8 GB of memory (compatible with the Direct X,
OpenGL, OpenCL, and Vulkan APIs).
</li>
<li class='itemize'>Salus compute node (CSSE CLAC, Drs. Bergler and Kosseim), 56 cores and 728GB of
RAM, see Figure <a href='#-speed-cluster-hardware-architecture'>2<!-- tex4ht:ref: fig:speed-architecture-full --></a>.
</li>
<li class='itemize'>Magic subcluster partition (ECE, Dr. Khendek, 11 nodes, see Figure <a href='#-speed-cluster-hardware-architecture'>2<!-- tex4ht:ref: fig:speed-architecture-full --></a>).
</li>
<li class='itemize'>Nebular subcluster partition (CIISE, Drs. Yan, Assi, Ghafouri, et al., Nebulae GPU
node with 2x RTX 6000 Ada 48GB cards, Stellar compute node, and Matrix 177TB
storage/compute node, see Figure <a href='#-speed-cluster-hardware-architecture'>2<!-- tex4ht:ref: fig:speed-architecture-full --></a>).</li></ul>
<figure class='figure' id='-speed'>
<a id='x1-60011'></a>
<!-- l. 226 --><p class='noindent'><img alt='PIC' height='412' src='images/speed-pics.png' width='412' />
</p>
<figcaption class='caption'><span class='id'>Figure 1: </span><span class='content'>Speed</span></figcaption><!-- tex4ht:label?: x1-60011 -->
</figure>
<figure class='figure' id='-speed-cluster-hardware-architecture'>
<a id='x1-60022'></a>
<!-- l. 233 --><p class='noindent'><img alt='PIC' height='412' src='images/speed-architecture-full.png' width='412' />
</p>
<figcaption class='caption'><span class='id'>Figure 2: </span><span class='content'>Speed Cluster Hardware Architecture</span></figcaption><!-- tex4ht:label?: x1-60022 -->
</figure>
<figure class='figure' id='-speed-slurm-architecture'>
<a id='x1-60033'></a>
<!-- l. 240 --><p class='noindent'><img alt='PIC' height='412' src='images/slurm-arch.png' width='412' />
</p>
<figcaption class='caption'><span class='id'>Figure 3: </span><span class='content'>Speed SLURM Architecture</span></figcaption><!-- tex4ht:label?: x1-60033 -->
</figure>
<h4 class='subsectionHead' id='what-speed-is-ideal-for'><span class='titlemark'>1.5 </span> <a id='x1-70001.5'></a>What Speed Is Ideal For</h4>
<ul class='itemize1'>
<li class='itemize'>Design, develop, test, and run parallel, batch, and other algorithms and scripts with
partial data sets. “Speed” has been optimized for compute jobs that are multi-core aware,
require a large memory space, or are iteration intensive.
</li>
<li class='itemize'>
<!-- l. 257 --><p class='noindent'>Prepare jobs for large clusters such as: </p>
<ul class='itemize2'>
<li class='itemize'>Digital Research Alliance of Canada (Calcul Quebec and Compute Canada)
</li>
<li class='itemize'>Cloud platforms</li></ul>
</li>
<li class='itemize'>Jobs that are too demanding for a desktop.
</li>
<li class='itemize'>Single-core batch jobs; multithreaded jobs typically up to 32 cores (i.e., a single
machine).
</li>
<li class='itemize'>Multi-node multi-core jobs (MPI).
</li>
<li class='itemize'>Anything that can fit into a 500-GB memory space and a <span class='cmbx-10'>speed scratch </span>space of
approximately 10 TB.
</li>
<li class='itemize'>CPU-based jobs.
</li>
<li class='itemize'>CUDA GPU jobs.
</li>
<li class='itemize'>Non-CUDA GPU jobs using OpenCL.</li></ul>
<!-- l. 280 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='what-speed-is-not'><span class='titlemark'>1.6 </span> <a id='x1-80001.6'></a>What Speed Is Not</h4>
<ul class='itemize1'>
<li class='itemize'>Speed is not a web host and does not host websites.
</li>
<li class='itemize'>Speed is not meant for Continuous Integration (CI) automation deployments for Ansible
or similar tools.
</li>
<li class='itemize'>Does not run Kubernetes or other container orchestration software.
</li>
<li class='itemize'>Does not run Docker. (<span class='cmbx-10'>Note: </span>Speed does run Singularity and many Docker containers
can be converted to Singularity containers with a single command. See Section <a href='#singularity-containers'>2.16<!-- tex4ht:ref: sect:singularity-containers --></a>.)
</li>
<li class='itemize'>Speed is not for jobs executed outside of the scheduler. (Jobs running outside of the
scheduler will be killed and all data lost.)</li></ul>
<!-- l. 294 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='available-software'><span class='titlemark'>1.7 </span> <a id='x1-90001.7'></a>Available Software</h4>
<!-- l. 297 --><p class='noindent'>There are a wide range of open-source and commercial software available and installed on “Speed.”
This includes Abaqus <span class='cite'>[<a href='#Xabaqus'>1</a>]</span>, AllenNLP, Anaconda, ANSYS, Bazel, COMSOL, CPLEX, CUDA, Eclipse,
Fluent <span class='cite'>[<a href='#Xfluent'>2</a>]</span>, Gurobi, MATLAB <span class='cite'>[<a href='#Xmatlab'>15</a>, <a href='#Xscholarpedia-matlab'>30</a>]</span>, OMNeT++, OpenCV, OpenFOAM, OpenMPI, OpenPMIx,
ParaView, PyTorch, QEMU, R, Rust, and Singularity among others. Programming environments
include various versions of Python, C++/Java compilers, TensorFlow, OpenGL, OpenISS, and
MARF <span class='cite'>[<a href='#Xmarf'>31</a>]</span>.<br class='newline' />
</p><!-- l. 303 --><p class='indent'> In particular, there are over 2200 programs available in <span class='cmtt-10'>/encs/bin </span>and <span class='cmtt-10'>/encs/pkg </span>under Scientific
Linux 7 (EL7). We are building an equivalent array of programs for the EL9 SPEED2 nodes. To see
the packages available, run <span class='cmtt-10'>ls -al /encs/pkg/ </span>on <span class='cmtt-10'>speed.encs</span>. See a complete list in
Appendix <a href='#software-installed-on-speed'>D<!-- tex4ht:ref: sect:software-details --></a>.<br class='newline' />
</p><!-- l. 307 --><p class='noindent'><span class='cmbx-10'>Note: </span>We do our best to accommodate custom software requests. Python environments can use
user-custom installs from within the scratch directory.
</p><!-- l. 313 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='requesting-access'><span class='titlemark'>1.8 </span> <a id='x1-100001.8'></a>Requesting Access</h4>
<!-- l. 316 --><p class='noindent'>After reviewing the “What Speed is” (Section <a href='#what-speed-is-ideal-for'>1.5<!-- tex4ht:ref: sect:speed-is-for --></a>) and “What Speed is Not” (Section <a href='#what-speed-is-not'>1.6<!-- tex4ht:ref: sect:speed-is-not --></a>), request
access to the “Speed” cluster by emailing: <span class='cmtt-10'>rt-ex-hpc AT encs.concordia.ca</span>.
</p>
<ul class='itemize1'>
<li class='itemize'>GCS ENCS faculty and staff may request access directly.
</li>
<li class='itemize'>
<!-- l. 322 --><p class='noindent'>GCS students must include the following in their request message: </p>
<ul class='itemize2'>
<li class='itemize'>GCS ENCS username
</li>
<li class='itemize'>Name and email (CC) of the approver – either a supervisor, course instructor, or a
department representative (e.g., in the case of undergraduate or M.Eng. students it
can be the Chair, associate chair, a technical officer, or a department administrator)
for approval.
</li>
<li class='itemize'>Written request from the approver for the GCS ENCS username to be granted access
to “Speed.”</li></ul>
</li>
<li class='itemize'>Non-GCS students taking a GCS course will have their GCS ENCS account created
automatically, but still need the course instructor’s approval to use the service.
</li>
<li class='itemize'>Non-GCS faculty and students need to get a “sponsor” within GCS, so that a guest GCS ENCS
account is created first. A sponsor can be any GCS Faculty member you collaborate with.
Failing that, request the approval from our Dean’s Office; via our Associate Deans Drs. Eddie
Hoi Ng or Emad Shihab.
</li>
<li class='itemize'>External entities collaborating with GCS Concordia researchers should also go through the
Dean’s Office for approvals.</li></ul>
<!-- l. 347 --><p class='noindent'>
</p>
<h3 class='sectionHead' id='job-management'><span class='titlemark'>2 </span> <a id='x1-110002'></a>Job Management</h3>
<!-- l. 350 --><p class='noindent'>We use SLURM as the workload manager. It supports primarily two types of jobs: batch and
interactive. Batch jobs are used to run unattended tasks, whereas, interactive jobs are are ideal for
setting up virtual environments, compilation, and debugging.<br class='newline' />
</p><!-- l. 354 --><p class='noindent'><span class='cmbx-10'>Note: </span>In the following instructions, anything bracketed like, <span class='obeylines-h'><span class='verb'><span class='cmtt-10'><></span></span></span>, indicates a label/value to be replaced
(the entire bracketed term needs replacement).<br class='newline' />
</p><!-- l. 357 --><p class='noindent'>Job instructions in a script start with <span class='obeylines-h'><span class='verb'><span class='cmtt-10'>#SBATCH</span></span></span> prefix, for example:
</p>
<pre class='verbatim' id='verbatim-1'>
#SBATCH --mem=100M -t 600 -J <job-name> -A <slurm account>
#SBATCH -p pg --gpus=2 --mail-type=ALL
</pre>
<!-- l. 361 --><p class='nopar'> For complex compute steps within a script, use <span class='cmtt-10'>srun</span>. We recommend using <span class='cmtt-10'>salloc </span>for interactive
jobs as it supports multiple steps. However, <span class='cmtt-10'>srun </span>can also be used to start interactive jobs (see
Section <a href='#interactive-jobs'>2.8<!-- tex4ht:ref: sect:interactive-jobs --></a>). Common and required job parameters include:
</p>
<div class='columns-2'>
<ul class='itemize1'>
<li class='itemize'>memory (<span class='cmtt-10'>--mem</span>),
</li>
<li class='itemize'>time (<span class='cmtt-10'>-t</span>),
</li>
<li class='itemize'><span class='cmtt-10'>--job-name </span>(<span class='cmtt-10'>-J</span>),
</li>
<li class='itemize'>slurm project account (<span class='cmtt-10'>-A</span>),
</li>
<li class='itemize'>partition (<span class='cmtt-10'>-p</span>),
</li>
<li class='itemize'>mail type (<span class='cmtt-10'>--mail-type</span>),
</li>
<li class='itemize'>ntasks (<span class='cmtt-10'>-n</span>),
</li>
<li class='itemize'>CPUs per task (<span class='cmtt-10'>--cpus-per-task</span>).</li></ul>
</div>
<!-- l. 391 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='getting-started'><span class='titlemark'>2.1 </span> <a id='x1-120002.1'></a>Getting Started</h4>
<!-- l. 394 --><p class='noindent'>Before getting started, please review the “What Speed is” (Section <a href='#what-speed-is-ideal-for'>1.5<!-- tex4ht:ref: sect:speed-is-for --></a>) and “What Speed is Not”
(Section <a href='#what-speed-is-not'>1.6<!-- tex4ht:ref: sect:speed-is-not --></a>). Once your GCS ENCS account has been granted access to “Speed”, use
your GCS ENCS account credentials to create an SSH connection to <span class='cmtt-10'>speed </span>(an alias for
<span class='cmtt-10'>speed-submit.encs.concordia.ca</span>).<br class='newline' />
</p><!-- l. 400 --><p class='indent'> All users are expected to have a basic understanding of Linux and its commonly used commands
(see Appendix <a href='#frequently-asked-questions'>B<!-- tex4ht:ref: sect:faqs --></a> for resources).
</p><!-- l. 405 --><p class='noindent'>
</p>
<h5 class='subsubsectionHead' id='ssh-connections'><span class='titlemark'>2.1.1 </span> <a id='x1-130002.1.1'></a>SSH Connections</h5>
<!-- l. 408 --><p class='noindent'>Requirements to create connections to “Speed”:
</p><ol class='enumerate1'>
<li class='enumerate' id='x1-13002x1'><span class='cmbx-10'>Active GCS ENCS user account: </span>Ensure you have an active GCS ENCS user account
with permission to connect to Speed (see Section <a href='#requesting-access'>1.8<!-- tex4ht:ref: sect:access-requests --></a>).
</li>
<li class='enumerate' id='x1-13004x2'><span class='cmbx-10'>VPN Connection </span>(for off-campus access): If you are off-campus, you wil need to
establish an active connection to Concordia’s VPN, which requires a Concordia netname.
</li>
<li class='enumerate' id='x1-13006x3'><span class='cmbx-10'>Terminal Emulator for Windows: </span>Windows systems use a terminal emulator such as
PuTTY, Cygwin, or MobaXterm.
</li>
<li class='enumerate' id='x1-13008x4'><span class='cmbx-10'>Terminal for macOS: </span>macOS systems have a built-in Terminal app or <span class='cmtt-10'>xterm </span>that comes
with XQuartz.</li></ol>
<!-- l. 418 --><p class='noindent'>To create an SSH connection to Speed, open a terminal window and type the following command,
replacing <span class='obeylines-h'><span class='verb'><span class='cmtt-10'><ENCSusername></span></span></span> with your ENCS account’s username:
</p>
<pre class='verbatim' id='verbatim-2'>
ssh <ENCSusername>@speed.encs.concordia.ca
</pre>
<!-- l. 421 --><p class='nopar'>
</p><!-- l. 423 --><p class='noindent'>For detailed instructions on securely connecting to a GCS server, refer to the AITS FAQ: <a href='https://www.concordia.ca/ginacody/aits/support/faq/ssh-to-gcs.html'>How do I
securely connect to a GCS server?</a>
</p><!-- l. 429 --><p class='noindent'>
</p>
<h5 class='subsubsectionHead' id='environment-set-up'><span class='titlemark'>2.1.2 </span> <a id='x1-140002.1.2'></a>Environment Set Up</h5>
<!-- l. 5 --><p class='noindent'>After creating an SSH connection to Speed, you will need to make sure the <span class='cmtt-10'>srun</span>, <span class='cmtt-10'>sbatch</span>, and <span class='cmtt-10'>salloc</span>
commands are available to you. To check this, type each command at the prompt and press Enter. If
“command not found” is returned, you need to make sure your <span class='tctt-1000'>$</span><span class='cmtt-10'>PATH </span>includes <span class='cmtt-10'>/local/bin</span>. You can
check your <span class='tctt-1000'>$</span><span class='cmtt-10'>PATH </span>by typing:
</p>
<pre class='verbatim' id='verbatim-3'>
echo $PATH
</pre>
<!-- l. 14 --><p class='nopar'>
</p><!-- l. 45 --><p class='noindent'>The next step is to set up your cluster-specific storage “speed-scratch”, to do so, execute the following
command from within your home directory.
</p>
<pre class='verbatim' id='verbatim-4'>
mkdir -p /speed-scratch/$USER && cd /speed-scratch/$USER
</pre>
<!-- l. 49 --><p class='nopar'>
</p><!-- l. 51 --><p class='noindent'>Next, copy a job template to your cluster-specific storage </p>
<ul class='itemize1'>
<li class='itemize'>From Windows drive G: to Speed:<br class='newline' /><span class='obeylines-h'><span class='verb'><span class='cmtt-10'>cp /winhome/<1st letter of $USER>/$USER/example.sh /speed-scratch/$USER/</span></span></span>
</li>
<li class='itemize'>From Linux drive U: to Speed:<br class='newline' /><span class='obeylines-h'><span class='verb'><span class='cmtt-10'>cp ~/example.sh /speed-scratch/$USER/</span></span></span></li></ul>
<!-- l. 59 --><p class='noindent'><span class='cmbx-10'>Tip: </span>the default shell for GCS ENCS users is <span class='cmtt-10'>tcsh</span>. If you would like to use <span class='cmtt-10'>bash</span>, please contact
<span class='cmtt-10'>rt-ex-hpc AT encs.concordia.ca</span>.<br class='newline' />
</p><!-- l. 102 --><p class='noindent'><span class='cmbx-10'>Note: </span>If you encounter a “command not found” error after logging in to Speed, your user account
may have defunct Grid Engine environment commands. See Appendix <a href='#migration-from-uge-to-slurm'>A.2<!-- tex4ht:ref: appdx:uge-to-slurm --></a> for instructions on how to
resolve this issue.
</p><!-- l. 435 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='job-submission-basics'><span class='titlemark'>2.2 </span> <a id='x1-150002.2'></a>Job Submission Basics</h4>
<!-- l. 438 --><p class='noindent'>Preparing your job for submission is fairly straightforward. Start by basing your job script on one of
the examples available in the <span class='cmtt-10'>src/ </span>directory of our <a href='https://github.com/NAG-DevOps/speed-hpc'>GitHub repository</a>. You can clone the repository to
get the examples to start with via the command line:
</p>
<pre class='verbatim' id='verbatim-5'>
git clone --depth=1 https://github.com/NAG-DevOps/speed-hpc.git
cd speed-hpc/src
</pre>
<!-- l. 446 --><p class='nopar'>
</p><!-- l. 448 --><p class='noindent'>The job script is a shell script that contains directives, module loads, and user scripting. To quickly
run some sample jobs, use the following commands:
</p>
<pre class='verbatim' id='verbatim-6'>
sbatch -p ps -t 10 env.sh
sbatch -p ps -t 10 bash.sh
sbatch -p ps -t 10 manual.sh
sbatch -p pg -t 10 lambdal-singularity.sh
</pre>
<!-- l. 455 --><p class='nopar'>
</p><!-- l. 460 --><p class='noindent'>
</p>
<h5 class='subsubsectionHead' id='directives'><span class='titlemark'>2.2.1 </span> <a id='x1-160002.2.1'></a>Directives</h5>
<!-- l. 5 --><p class='noindent'>Directives are comments included at the beginning of a job script that set the shell and the options for
the job scheduler. The shebang directive is always the first line of a script. In your job script, this
directive sets which shell your script’s commands will run in. On “Speed”, we recommend that your
script use a shell from the <span class='cmtt-10'>/encs/bin </span>directory.<br class='newline' />
</p><!-- l. 12 --><p class='indent'> To use the <span class='cmtt-10'>tcsh </span>shell, start your script with <span class='obeylines-h'><span class='verb'><span class='cmtt-10'>#!/encs/bin/tcsh</span></span></span>. For <span class='cmtt-10'>bash</span>, start with
<span class='obeylines-h'><span class='verb'><span class='cmtt-10'>#!/encs/bin/bash</span></span></span>.<br class='newline' />
</p><!-- l. 15 --><p class='indent'> Directives that start with <span class='obeylines-h'><span class='verb'><span class='cmtt-10'>#SBATCH</span></span></span> set the options for the cluster’s SLURM job scheduler. The
following provides an example of some essential directives:
</p>
<pre class='verbatim' id='verbatim-7'>
#SBATCH --job-name=<jobname> ## or -J. Give the job a name
#SBATCH --mail-type=<type> ## set type of email notifications
#SBATCH --chdir=<directory> ## or -D, set working directory for the job
#SBATCH --nodes=1 ## or -N, node count required for the job
#SBATCH --ntasks=1 ## or -n, number of tasks to be launched
#SBATCH --cpus-per-task=<corecount> ## or -c, core count requested, e.g. 8 cores
#SBATCH --mem=<memory> ## assign memory for this job,
## e.g., 32G memory per node
</pre>
<!-- l. 28 --><p class='nopar'>
</p><!-- l. 31 --><p class='noindent'>Replace the following to adjust the job script for your project(s) </p>
<ul class='itemize1'>
<li class='itemize'><span class='obeylines-h'><span class='verb'><span class='cmtt-10'><jobname></span></span></span> with a job name for the job. This name will be displayed in the job queue.
</li>
<li class='itemize'><span class='obeylines-h'><span class='verb'><span class='cmtt-10'><directory></span></span></span> with the fullpath to your job’s working directory, e.g., where your code,
source files and where the standard output files will be written to. By default, <span class='obeylines-h'><span class='verb'><span class='cmtt-10'>--chdir</span></span></span>
sets the current directory as the job’s working directory.
</li>
<li class='itemize'><span class='obeylines-h'><span class='verb'><span class='cmtt-10'><type></span></span></span> with the type of e-mail notifications you wish to receive. Valid options are: NONE,
BEGIN, END, FAIL, REQUEUE, ALL.
</li>
<li class='itemize'><span class='obeylines-h'><span class='verb'><span class='cmtt-10'><corecount></span></span></span> with the degree of multithreaded parallelism (i.e., cores) allocated to your
job. Up to 32 by default.
</li>
<li class='itemize'><span class='obeylines-h'><span class='verb'><span class='cmtt-10'><memory></span></span></span> with the amount of memory, in GB, that you want to be allocated per node.
Up to 500 depending on the node.<br class='newline' /><span class='cmbx-10'>Note</span>: All jobs MUST set a value for the <span class='cmtt-10'>--mem </span>option.</li></ul>
<!-- l. 44 --><p class='noindent'>Example with short option equivalents:
</p>
<pre class='verbatim' id='verbatim-8'>
#SBATCH -J myjob ## Job’s name set to ’myjob’
#SBATCH --mail-type=ALL ## Receive all email type notifications
#SBATCH -D ./ ## Use current directory as working directory
#SBATCH -N 1 ## Node count required for the job
#SBATCH -n 1 ## Number of tasks to be launched
#SBATCH -c 8 ## Request 8 cores
#SBATCH --mem=32G ## Allocate 32G memory per node
</pre>
<!-- l. 54 --><p class='nopar'>
</p><!-- l. 57 --><p class='noindent'><span class='cmbx-10'>Tip: </span>If you are unsure about memory footprints, err on assigning a generous memory space to
your job, so that it does not get prematurely terminated. You can refine <span class='cmtt-10'>--mem </span>values
for future jobs by monitoring the size of a job’s active memory space on <span class='cmtt-10'>speed-submit</span>
with:
</p>
<pre class='verbatim' id='verbatim-9'>
sacct -j <jobID>
sstat -j <jobID>
</pre>
<!-- l. 65 --><p class='nopar'>
</p><!-- l. 67 --><p class='noindent'>This can be customized to show specific columns:
</p>
<pre class='verbatim' id='verbatim-10'>
sacct -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j <jobID>
sstat -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j <jobID>
</pre>
<!-- l. 72 --><p class='nopar'>
</p><!-- l. 74 --><p class='noindent'>Memory-footprint efficiency values (<span class='cmtt-10'>seff</span>) are also provided for completed jobs in the final email
notification as “maxvmsize”. <span class='cmti-10'>Jobs that request a low-memory footprint are more likely to load on a
busy cluster.</span><br class='newline' />
</p><!-- l. 79 --><p class='noindent'>Other essential options are <span class='cmtt-10'>--time</span>, or <span class='cmtt-10'>-t</span>, and <span class='cmtt-10'>--account</span>, or <span class='cmtt-10'>-A</span>. </p>
<ul class='itemize1'>
<li class='itemize'><span class='cmtt-10'>--time=<time> </span>– is the estimate of wall clock time required for your job to run. As
previously mentioned, the maximum is 7 days for batch and 24 hours for interactive jobs.
Jobs with a smaller <span class='cmtt-10'>time </span>value will have a higher priority and may result in your job
being scheduled sooner.
</li>
<li class='itemize'><span class='cmtt-10'>--account=<name> </span>– specifies which Account, aka project or association, that the Speed
resources your job uses should be attributed to. When moving from GE to SLURM users
most users were assigned to Speed’s two default accounts <span class='cmtt-10'>speed1 </span>and <span class='cmtt-10'>speed2</span>. However,
users that belong to a particular research group or project are will have a default Account
like the following <span class='cmtt-10'>aits</span>, <span class='cmtt-10'>vidpro</span>, <span class='cmtt-10'>gipsy</span>, <span class='cmtt-10'>ai2</span>, <span class='cmtt-10'>mpackir</span>, <span class='cmtt-10'>cmos</span>, among others.</li></ul>
<!-- l. 467 --><p class='noindent'>
</p>
<h5 class='subsubsectionHead' id='working-with-modules'><span class='titlemark'>2.2.2 </span> <a id='x1-170002.2.2'></a>Working with Modules</h5>
<!-- l. 470 --><p class='noindent'>After setting the directives in your job script, the next section typically involves loading the necessary
software modules. The <span class='cmtt-10'>module </span>command is used to manage the user environment, make sure to load
all the modules your job depends on. You can check available modules with the module avail
command. Loading the correct modules ensures that your environment is properly set up for
execution.<br class='newline' />
</p><!-- l. 476 --><p class='noindent'>To list for a particular program (<span class='cmtt-10'>matlab</span>, for example):
</p>
<pre class='verbatim' id='verbatim-11'>
module avail
module -t avail matlab ## show the list for a particular program (e.g., matlab)
module -t avail m ## show the list for all programs starting with m
</pre>
<!-- l. 483 --><p class='nopar'>
</p><!-- l. 486 --><p class='noindent'>For example, insert the following in your script to load the <span class='cmtt-10'>matlab/R2023a </span>module:
</p>
<pre class='verbatim' id='verbatim-12'>
module load matlab/R2023a/default
</pre>
<!-- l. 489 --><p class='nopar'>
</p><!-- l. 491 --><p class='noindent'><span class='cmbx-10'>Note: </span>you can remove a module from active use by replacing <span class='cmtt-10'>load </span>by <span class='cmtt-10'>unload</span>.<br class='newline' />
</p><!-- l. 494 --><p class='noindent'>To list loaded modules:
</p>
<pre class='verbatim' id='verbatim-13'>
module list
</pre>
<!-- l. 497 --><p class='nopar'>
</p><!-- l. 499 --><p class='noindent'>To purge all software in your working environment:
</p>
<pre class='verbatim' id='verbatim-14'>
module purge
</pre>
<!-- l. 502 --><p class='nopar'>
</p><!-- l. 507 --><p class='noindent'>
</p>
<h5 class='subsubsectionHead' id='user-scripting'><span class='titlemark'>2.2.3 </span> <a id='x1-180002.2.3'></a>User Scripting</h5>
<!-- l. 5 --><p class='noindent'>The final part of the job script involves the commands that will be executed by the job. This section
should include all necessary commands to set up and run the tasks your script is designed to perform.
You can use any Linux command in this section, ranging from a simple executable call to a complex
loop iterating through multiple commands.<br class='newline' />
</p><!-- l. 10 --><p class='noindent'><span class='cmbx-10'>Best Practice</span>: prefix any compute-heavy step with <span class='cmtt-10'>srun</span>. This ensures you gain proper insights on
the execution of your job.<br class='newline' />
</p><!-- l. 13 --><p class='noindent'>Each software program may have its own execution framework, as it’s the script’s author (e.g., you)
responsibility to review the software’s documentation to understand its requirements. Your script
should be written to clearly specify the location of input and output files and the degree of parallelism
needed.<br class='newline' />
</p><!-- l. 17 --><p class='noindent'>Jobs that involve multiple interactions with data input and output files, should make use of <span class='cmtt-10'>TMPDIR</span>, a
scheduler-provided workspace nearly 1 TB in size. <span class='cmtt-10'>TMPDIR </span>is created on the local disk of the compute
node at the start of a job, offering faster I/O operations compared to shared storage (provided over
NFS).
</p><!-- l. 22 --><p class='indent'> An sample job script using <span class='cmtt-10'>TMPDIR </span>is available at <span class='cmtt-10'>/home/n/nul-uge/templateTMPDIR.sh</span>: the job
is instructed to change to <span class='tctt-1000'>$</span><span class='cmtt-10'>TMPDIR</span>, to make the new directory <span class='cmtt-10'>input</span>, to copy data from
<span class='tctt-1000'>$</span><span class='cmtt-10'>SLURM_SUBMIT_DIR/references/ </span>to <span class='cmtt-10'>input/ </span>(<span class='tctt-1000'>$</span><span class='cmtt-10'>SLURM_SUBMIT_DIR </span>represents the current working
directory), to make the new directory <span class='cmtt-10'>results</span>, to execute the program (which takes input from
<span class='tctt-1000'>$</span><span class='cmtt-10'>TMPDIR/input/ </span>and writes output to <span class='tctt-1000'>$</span><span class='cmtt-10'>TMPDIR/results/</span>), and finally to copy the total end results
to an existing directory, <span class='cmtt-10'>processed</span>, that is located in the current working directory. <span class='cmtt-10'>TMPDIR </span>only
exists for the duration of the job, though, so it is very important to copy relevant results from it at
job’s end.
</p><!-- l. 36 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='sample-job-script'><span class='titlemark'>2.3 </span> <a id='x1-190002.3'></a>Sample Job Script</h4>
<!-- l. 39 --><p class='noindent'>Here’s a basic job script, <a class='url' href='tcsh.sh'><span class='cmtt-10'>tcsh.sh</span></a> shown in Figure <a href='#-source-code-for-tcshsh'>4<!-- tex4ht:ref: fig:tcsh.sh --></a>. You can copy it from our <a href='https://github.com/NAG-DevOps/speed-hpc'>GitHub
repository</a>.
</p>
<figure class='figure' id='-source-code-for-tcshsh'>
<a id='x1-190104'></a>
<!-- l. 43 --><pre class='lstinputlisting' id='listing-1'><span class='label'><a id='x1-19002r1'></a></span><span style='color:#000000'><span class='cmitt-10'>#</span></span><span style='color:#000000'><span class='cmitt-10'>!/</span></span><span style='color:#000000'><span class='cmitt-10'>encs</span></span><span style='color:#000000'><span class='cmitt-10'>/</span></span><span style='color:#000000'><span class='cmitt-10'>bin</span></span><span style='color:#000000'><span class='cmitt-10'>/</span></span><span style='color:#000000'><span class='cmitt-10'>tcsh</span></span>
<span class='label'><a id='x1-19003r2'></a></span>
<span class='label'><a id='x1-19004r3'></a></span><span style='color:#000000'><span class='cmitt-10'>#</span></span><span style='color:#000000'><span class='cmitt-10'>SBATCH</span></span><span style='color:#000000'> <span class='cmitt-10'>--job-name=tcsh-test</span>
</span><span class='label'><a id='x1-19005r4'></a></span><span style='color:#000000'><span class='cmitt-10'>#</span></span><span style='color:#000000'><span class='cmitt-10'>SBATCH</span></span><span style='color:#000000'> <span class='cmitt-10'>--mem=1G</span>
</span><span class='label'><a id='x1-19006r5'></a></span>
<span class='label'><a id='x1-19007r6'></a></span><span style='color:#000000'><span class='cmtt-10'>sleep</span></span><span style='color:#000000'> <span class='cmtt-10'>30</span>
</span><span class='label'><a id='x1-19008r7'></a></span><span style='color:#000000'><span class='cmtt-10'>module</span></span><span style='color:#000000'> <span class='cmtt-10'>load gurobi/8.1.0</span>
</span><span class='label'><a id='x1-19009r8'></a></span><span style='color:#000000'><span class='cmtt-10'>module</span></span><span style='color:#000000'> <span class='cmtt-10'>list</span></span>
</pre>
<figcaption class='caption'><span class='id'>Figure 4: </span><span class='content'>Source code for <a class='url' href='tcsh.sh'><span class='cmtt-10'>tcsh.sh</span></a></span></figcaption><!-- tex4ht:label?: x1-190104 -->
</figure>
<!-- l. 48 --><p class='noindent'>The first line is the shell declaration (also know as a shebang) and sets the shell to <span class='cmti-10'>tcsh</span>. The lines that
begin with <span class='cmtt-10'>#SBATCH </span>are directives for the scheduler. </p>
<ul class='itemize1'>
<li class='itemize'><span class='cmtt-10'>-J </span>(or <span class='cmtt-10'>--job-name</span>) sets <span class='cmti-10'>tcsh-test </span>as the job name.
</li>
<li class='itemize'><span class='cmtt-10'>--mem=1GB </span>requests and assigns 1GB of memory to the job. Jobs require the <span class='cmtt-10'>--mem </span>option
to be set either in the script or on the command line; <span class='cmbx-10'>if it’s missing, job submission
will be rejected.</span></li></ul>
<!-- l. 59 --><p class='noindent'>The script then:
</p><ol class='enumerate1'>
<li class='enumerate' id='x1-19012x1'>Sleeps on a node for 30 seconds.
</li>
<li class='enumerate' id='x1-19014x2'>Uses the <span class='cmtt-10'>module </span>command to load the <span class='cmtt-10'>gurobi/8.1.0 </span>environment.
</li>
<li class='enumerate' id='x1-19016x3'>Prints the list of loaded modules into a file.</li></ol>
<!-- l. 66 --><p class='noindent'>The scheduler command, <span class='cmtt-10'>sbatch</span>, is used to submit (non-interactive) jobs. From an ssh session on
“speed-submit”, submit this job with
</p>
<pre class='verbatim' id='verbatim-15'>
sbatch ./tcsh.sh
</pre>
<!-- l. 71 --><p class='nopar'>
</p><!-- l. 73 --><p class='noindent'>You will see, <span class='cmtt-10'>Submitted batch job 2653 </span>where \(2653\) is a job ID assigned. The commands <span class='cmtt-10'>squeue </span>and
<span class='cmtt-10'>sinfo </span>can be used to look at the status of the cluster:
</p>
<pre class='verbatim' id='verbatim-16'>
[serguei@speed-submit src] % squeue -l
Thu Oct 19 11:38:54 2023
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2641 ps interact b_user RUNNING 19:16:09 1-00:00:00 1 speed-07
2652 ps interact a_user RUNNING 41:40 1-00:00:00 1 speed-07
2654 ps tcsh-tes serguei RUNNING 0:01 7-00:00:00 1 speed-07
[serguei@speed-submit src] % sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
ps* up 7-00:00:00 14 drain speed-[08-10,12,15-16,20-22,30-32,35-36]
ps* up 7-00:00:00 1 mix speed-07
ps* up 7-00:00:00 7 idle speed-[11,19,23-24,29,33-34]
pg up 1-00:00:00 1 drain speed-17
pg up 1-00:00:00 3 idle speed-[05,25,27]
pt up 7-00:00:00 7 idle speed-[37-43]
pa up 7-00:00:00 4 idle speed-[01,03,25,27]
</pre>
<!-- l. 96 --><p class='nopar'>
</p><!-- l. 99 --><p class='noindent'><span class='cmbx-10'>Remember </span>that you only have 30 seconds before the job is essentially over, so if you do not see a
similar output, either adjust the sleep time in the script, or execute the <span class='cmtt-10'>squeue </span>statement more
quickly. The <span class='cmtt-10'>squeue </span>output listed above shows that your job 2654 is running on node <span class='cmtt-10'>speed-07</span>, and
its time limit is 7 days, etc.<br class='newline' />
</p><!-- l. 110 --><p class='indent'> Once the job finishes, there will be a new file in the directory that the job was started from,
with the syntax of, <span class='cmtt-10'>slurm-<job id>.out</span>, so in this example the file is, <a class='url' href='slurm-2654.out'><span class='cmtt-10'>slurm-2654.out</span></a>.
This file represents the standard output (and error, if there is any) of the job in question.
If you look at the contents of your newly created file, you will see that it contains the
output of the, <span class='cmtt-10'>module list </span>command. Important information is often written to this
file.
</p>
<h4 class='subsectionHead' id='common-job-management-commands-summary'><span class='titlemark'>2.4 </span> <a id='x1-200002.4'></a>Common Job Management Commands Summary</h4>
<!-- l. 125 --><p class='noindent'>Here is a summary of useful job management commands for handling various aspects of job
submission and monitoring on the Speed cluster:
</p>
<ul class='itemize1'>
<li class='itemize'>
<!-- l. 129 --><p class='noindent'>Submitting a job:
</p>
<pre class='verbatim' id='verbatim-17'>
sbatch -A <ACCOUNT> -t <MINUTES> --mem=<MEMORY> -p <PARTITION> ./<myscript>.sh
</pre>
<!-- l. 133 --><p class='nopar'>
</p></li>
<li class='itemize'>
<!-- l. 136 --><p class='noindent'>Checking your job(s) status:
</p>
<pre class='verbatim' id='verbatim-18'>
squeue -u <ENCSusername>
</pre>
<!-- l. 140 --><p class='nopar'>
</p></li>
<li class='itemize'>
<!-- l. 143 --><p class='noindent'>Displaying cluster status:
</p>
<pre class='verbatim' id='verbatim-19'>
squeue
</pre>
<!-- l. 147 --><p class='nopar'> </p>
<ul class='itemize2'>
<li class='itemize'>Use <span class='cmtt-10'>-A </span>for per account (e.g., <span class='cmtt-10'>-A vidpro</span>, <span class='cmtt-10'>-A aits</span>),
</li>
<li class='itemize'>Use <span class='cmtt-10'>-p </span>for per partition (e.g., <span class='cmtt-10'>-p ps</span>, <span class='cmtt-10'>-p pg</span>, <span class='cmtt-10'>-p pt</span>), etc.</li></ul>
</li>
<li class='itemize'>
<!-- l. 154 --><p class='noindent'>Displaying job information:
</p>
<pre class='verbatim' id='verbatim-20'>
squeue --job <job-ID>
</pre>
<!-- l. 158 --><p class='nopar'>
</p></li>
<li class='itemize'>
<!-- l. 161 --><p class='noindent'>Displaying individual job steps: (to see which step failed if you used <span class='cmtt-10'>srun</span>)
</p>
<pre class='verbatim' id='verbatim-21'>
squeue -las
</pre>
<!-- l. 165 --><p class='nopar'>
</p></li>
<li class='itemize'>
<!-- l. 168 --><p class='noindent'>Monitoring job and cluster status: (view <span class='cmtt-10'>sinfo </span>and watch the queue for your job(s))
</p>
<pre class='verbatim' id='verbatim-22'>
watch -n 1 "sinfo -Nel -pps,pt,pg,pa && squeue -la"
</pre>
<!-- l. 172 --><p class='nopar'>
</p></li>
<li class='itemize'>
<!-- l. 175 --><p class='noindent'>Canceling a job:
</p>
<pre class='verbatim' id='verbatim-23'>
scancel <job-ID>
</pre>
<!-- l. 179 --><p class='nopar'>
</p></li>
<li class='itemize'>
<!-- l. 182 --><p class='noindent'>Holding a job:
</p>
<pre class='verbatim' id='verbatim-24'>
scontrol hold <job-ID>
</pre>
<!-- l. 186 --><p class='nopar'>
</p></li>
<li class='itemize'>
<!-- l. 189 --><p class='noindent'>Releasing a job:
</p>
<pre class='verbatim' id='verbatim-25'>
scontrol release <job-ID>
</pre>
<!-- l. 193 --><p class='nopar'>
</p></li>
<li class='itemize'>
<!-- l. 196 --><p class='noindent'>Getting job statistics: (including useful metrics like “maxvmem”)
</p>
<pre class='verbatim' id='verbatim-26'>
sacct -j <job-ID>
</pre>
<!-- l. 200 --><p class='nopar'>
</p><!-- l. 203 --><p class='noindent'><span class='cmtt-10'>maxvmem </span>is one of the more useful stats that you can elect to display as a format
option.
</p>
<pre class='verbatim' id='verbatim-27'>
% sacct -j 2654
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2654 tcsh-test ps speed1 1 COMPLETED 0:0
2654.batch batch speed1 1 COMPLETED 0:0
2654.extern extern speed1 1 COMPLETED 0:0
% sacct -j 2654 -o jobid,user,account,MaxVMSize,Reason%10,TRESUsageOutMax%30
JobID User Account MaxVMSize Reason TRESUsageOutMax
------------ --------- ---------- ---------- ---------- ----------------------
2654 serguei speed1 None
2654.batch speed1 296840K energy=0,fs/disk=1975
2654.extern speed1 296312K energy=0,fs/disk=343
</pre>
<!-- l. 219 --><p class='nopar'>
</p><!-- l. 222 --><p class='noindent'>See <span class='cmtt-10'>man sacct </span>or <span class='cmtt-10'>sacct -e </span>for details of the available formatting options. You can define your
preferred default format in the <span class='cmtt-10'>SACCT_FORMAT </span>environment variable in your <span class='cmtt-10'>.cshrc </span>or <span class='cmtt-10'>.bashrc</span>
files.
</p></li>
<li class='itemize'>
<!-- l. 226 --><p class='noindent'>Displaying job efficiency: (including CPU and memory utilization)
</p>
<pre class='verbatim' id='verbatim-28'>
seff <job-ID>
</pre>
<!-- l. 230 --><p class='nopar'>
</p><!-- l. 233 --><p class='noindent'>Don’t execute it on <span class='cmtt-10'>RUNNING </span>jobs (only on completed/finished jobs), else efficiency statistics
may be misleading. If you define the following directive in your batch script, your
GCS ENCS email address will receive an email with <span class='cmtt-10'>seff</span>’s output when your job is
finished.
</p>
<pre class='verbatim' id='verbatim-29'>
#SBATCH --mail-type=ALL
</pre>
<!-- l. 241 --><p class='nopar'>
</p><!-- l. 244 --><p class='noindent'>Output example:
</p>
<pre class='verbatim' id='verbatim-30'>
Job ID: XXXXX
Cluster: speed
User/Group: user1/user1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:04:29
CPU Efficiency: 0.35% of 21:32:20 core-walltime
Job Wall-clock time: 05:23:05
Memory Utilized: 2.90 GB
Memory Efficiency: 2.90% of 100.00 GB
</pre>
<!-- l. 258 --><p class='nopar'></p></li></ul>
<!-- l. 265 --><p class='noindent'>
</p>
<h4 class='subsectionHead' id='advanced-sbatch-options'><span class='titlemark'>2.5 </span> <a id='x1-210002.5'></a>Advanced <span class='cmtt-10'>sbatch </span>Options</h4>
<!-- l. 269 --><p class='noindent'>In addition to the basic sbatch options presented earlier, there are several advanced options that are
generally useful:
</p>
<ul class='itemize1'>
<li class='itemize'>
<!-- l. 273 --><p class='noindent'>E-mail notifications:
</p>
<pre class='verbatim' id='verbatim-31'>
--mail-type=<TYPE>
</pre>
<!-- l. 276 --><p class='nopar'> Requests the scheduler to send an email when the job changes state. <span class='cmtt-10'><TYPE> </span>can be <span class='cmtt-10'>ALL</span>, <span class='cmtt-10'>BEGIN</span>,
<span class='cmtt-10'>END</span>, or <span class='cmtt-10'>FAIL</span>. Mail is sent to the default address of,
</p>
<pre class='verbatim' id='verbatim-32'>
<ENCSusername>@encs.concordia.ca
</pre>
<!-- l. 283 --><p class='nopar'> which you can consult via <a class='url' href='webmail.encs.concordia.ca'><span class='cmtt-10'>webmail.encs.concordia.ca</span></a> (use VPN from off-campus) unless a
different address is supplied (see, <span class='cmtt-10'>--mail-user</span>). The report sent when a job ends includes job
runtime, as well as the maximum memory value hit (<span class='cmtt-10'>maxvmem</span>).
</p>
<pre class='verbatim' id='verbatim-33'>
--mail-user [email protected]
</pre>
<!-- l. 292 --><p class='nopar'> Specifies a different email address for notifications rather than the default.
</p></li>
<li class='itemize'>
<!-- l. 295 --><p class='noindent'>Export environment variables used by the script.:
</p>
<pre class='verbatim' id='verbatim-34'>
--export=ALL
--export=NONE
--export=VARIABLES
</pre>
<!-- l. 300 --><p class='nopar'>
</p></li>
<li class='itemize'>
<!-- l. 302 --><p class='noindent'>Job runtime:
</p>
<pre class='verbatim' id='verbatim-35'>