diff --git a/README.md b/README.md index 8fd9cf3..a687602 100644 --- a/README.md +++ b/README.md @@ -34,6 +34,8 @@ $ mdbook serve --open - Jaspar Martin - Yuki Kume - Osman Haji +- Duc Thanh Vinh Nguyen +- Linton Charles ## Code of Conduct, License & Contributing diff --git a/src/.chapter7/challenges.md b/src/.chapter7/challenges.md new file mode 100644 index 0000000..9358534 --- /dev/null +++ b/src/.chapter7/challenges.md @@ -0,0 +1 @@ +# Challenges diff --git a/src/SUMMARY.md b/src/SUMMARY.md index 78a89db..7e95777 100644 --- a/src/SUMMARY.md +++ b/src/SUMMARY.md @@ -2,15 +2,17 @@ [Welcome](home.md) -- [Getting Started](./chapter1/getting-started.md) +- [Installation & Set-up](./chapter1/getting-started.md) - [GitHub](./chapter1/github.md) - [Windows](./chapter1/windows.md) - [Mac](./chapter1/mac.md) - [Linux](./chapter1/linux.md) - [WSL](./chapter1/wsl.md) + - [M3 MASSIVE](./chapter1/m3.md) + - [Nectar Cloud](./chapter1/nectar.md) - [Challenges](./chapter1/challenges.md) -- [Brief Introduction to C](./chapter2/intro-to-c.md) +- [Intro to C](./chapter2/intro-to-c.md) - [Hello World](./chapter2/helloworld.md) - [Compilation](./chapter2/compilation.md) - [Types & Variables](./chapter2/vars.md) @@ -20,34 +22,55 @@ - [Control Flow](./chapter2/ctrl-flow.md) - [Loops](./chapter2/loops.md) - [Functions](./chapter2/functions.md) - - [Pointers](./chapter2/pointers.md) - - [Dynamic Memory](./chapter2/memory.md) - - [Structures](./chapter2/structs.md) - - [Macros & The Preprocessor](./chapter2/macros.md) - [Challenges](./chapter2/challenges.md) -- [M3](./chapter3/chapter3.md) - - [Getting Started](./chapter3/start.md) - - [Logging In](./chapter3/login.md) - - [Linux Commands](./chapter3/linux-cmds.md) - - [M3's Shared Filesystem](./chapter3/shared-fs.md) - - [Software and Tooling](./chapter3/software-tooling.md) - - [Bash Scripts](./chapter3/bash.md) - - [Job batching & SLURM](./chapter3/slurm.md) - - [Strudel](./chapter3/strudel.md) +- [Operating Systems](./chapter3/chapter3.md) + - [Computer Architecture](./chapter3/computer-architecture.md) + - [Pointers & Memory](./chapter3/memory-pointers.md) + - [Intro to Linux](./chapter3/linux-intro.md) + - [Threading & Concurrency](./chapter3/threads-concurrency.md) + - [Processes](./chapter3/processes.md) + - [Scheduling Algorithms](./chapter3/scheduling.md) - [Challenges](./chapter3/challenges.md) -- [Parallel Computing](./chapter4/chapter4.md) - - [What is Parallel Computing?](./chapter4/parallel-computing.md) - - [Multithreading](./chapter4/multithreading.md) - - [OpenMP](./chapter4/openmp.md) +- [More C](./chapter4/chapter4.md) + - [Dynamic Memory](./chapter4/memory.md) + - [Structures](./chapter4/structs.md) + - [Macros & The Preprocessor](./chapter4/macros.md) + - [System Calls](./chapter4/syscalls.md) + - [Spawning Processes & Threads](./chapter4/spawn-procs.md) - [Challenges](./chapter4/challenges.md) -- [Distributed Computing](./chapter5/chapter5.md) - - [Refresher on Parallelism](./chapter5/parallel-refresher.md) - - [What is Distributed Computing](./chapter5/distributed-computing.md) - - [Message Passing](./chapter5/message-passing.md) - - [OpenMPI](./chapter5/openmpi.md) +- [M3 & SLURM](./chapter5/chapter5.md) + + - [Batch Processing vs. Cloud Computing](./chapter5/batch-cloud.md) + - [Parallel & Distributed Computing](./chapter5/parallel-distributed.md) + - [M3 Login - SSH & Strudel](./chapter5/login.md) + - [Intro to SLURM](./chapter5/slurm_intro.md) + - [M3 Interface & Usage](./chapter5/m3-interface.md) + - [Software & Tooling](./chapter5/software-tooling.md) - [Challenges](./chapter5/challenges.md) -[Acknowledgements](./acknowledgements.md) +- [Introduction to Parallel Computing](./chapter6/chapter6.md) + - [Multithreading](./chapter6/multithreading.md) + - [Synchronisation](./chapter6/synchronisation.md) + - [Locks](./chapter6/locks.md) + - [Message Passing](./chapter6/message-passing.md) + - [Challenges](./chapter6/challenges.md) + +- [Parallellisation of Algorithms](./chapter7/chapter7.md) + - [Parallel Search](./chapter7/parallel-search.md) + - [Parallel Sort](./chapter7/parallel-sort.md) + - [Other Parallel Algorithms](./chapter7/other-parallel-algos.md) + - [Machine Learning & HPC](./chapter7/machine-learning-and-hpc.md) + - [Optimisation Algorithms](./chapter7/optim-algos.md) + - [Challenges](./chapter7/challenges.md) + +- [Apache Spark](./chapter8/chapter8.md) + - [Installation & Cluster Set-up](./chapter8/set-up.md) + - [Internal Architecture](./chapter8/internals.md) + - [Data Processing](./chapter8/data-processing.md) + - [Job Batching](./chapter8/job-batching.md) + - [Challenges](./chapter8/challenges.md) + +[Acknowledgements](./acknowledgements.md) \ No newline at end of file diff --git a/src/acknowledgements.md b/src/acknowledgements.md index 77f88ba..c5ad9fb 100644 --- a/src/acknowledgements.md +++ b/src/acknowledgements.md @@ -8,6 +8,8 @@ This book is part of Monash DeepNeurons collection of technical information and - [Osman Haji](https://github.com/Ozzywap) - [Yuki Kume](https://github.com/UnciaBit) - [Jaspar Martin](https://github.com/jasparm) +- [Duc Thanh Vinh Nguyen](https://github.com/VincentNguyenDuc) +- [Linton Charles](https://github.com/linton2000) ## Contributors diff --git a/src/chapter1/aaf.png b/src/chapter1/aaf.png new file mode 100644 index 0000000..836a38d Binary files /dev/null and b/src/chapter1/aaf.png differ diff --git a/src/chapter1/hpcid.png b/src/chapter1/hpcid.png new file mode 100644 index 0000000..97ef6ae Binary files /dev/null and b/src/chapter1/hpcid.png differ diff --git a/src/chapter1/join_project.png b/src/chapter1/join_project.png new file mode 100644 index 0000000..f50e537 Binary files /dev/null and b/src/chapter1/join_project.png differ diff --git a/src/chapter3/start.md b/src/chapter1/m3.md similarity index 77% rename from src/chapter3/start.md rename to src/chapter1/m3.md index ab22ab8..226a963 100644 --- a/src/chapter3/start.md +++ b/src/chapter1/m3.md @@ -1,14 +1,19 @@ -# Getting Started +# M3 MASSIVE + +MASSIVE (Multi-modal Australian ScienceS Imaging and Visualisation Environment) is a HPC supercomputing cluster that you will have access to as an MDN member. In this page we will set you up with access before you learn how to use it in Chapter 5. Feel free to go through the docs to learn about the [hardware config](https://docs.massive.org.au/M3/m3users.html) of M3 (3rd version of MASSIVE) and it's [institutional governance](https://massive.org.au/about/about.html#governance). ## Request an account -In order to access M3, you will need to request an account. To do this, follow this link: [HPC ID](https://hpc.erc.monash.edu.au/karaage/aafbootstrap). This should take you to a page this this: +In order to access M3, you will need to request an account. To do this, follow this link: [HPC ID](https://hpc.erc.monash.edu.au/karaage/aafbootstrap). This should take you to a page this this: + -![HPC ID](./imgs/aaf.png) +![HPC ID](./aaf.png) Type in Monash, as you can see here. Select Monash University, and tick the Remember my organisation box down the bottom. Once you continue to your organisation, it will take you to the Monash Uni SSO login page. You will need to login with your Monash credentials. -You should now see something like this: ![HPC ID System](./imgs/hpcid.png) +You should now see something like this: + +![HPC ID System](./hpcid.png) Once you are here, there are a couple things you will need to do. The first, and most important is to set your HPC password. This is the password you will use to login to M3. To do this, go to home, then click on Change Linux Password. This will take you through the steps of setting your password. @@ -16,7 +21,9 @@ Once you have done this, you can move on to requesting access to the MDN project ## Add to project -To request to join the MDN project, again from the Home page click on Join Exiting Project. You should see a screen like this: ![Join Project](./imgs/join_project.png) +To request to join the MDN project, again from the Home page click on Join Exiting Project. You should see a screen like this: + +![Join Project](./join_project.png) In the text box type `vf38` and click search. This is the project code for MDN. Then select the project and click submit. You will now have to wait for the project admins to approve your request. Once they have done this, you will be able to access the project. This should not take longer than a few days, and you will get an email telling you when you have access. @@ -47,4 +54,4 @@ cat ~/.ssh/id_ed25519.pub Then, go to your github account, go to settings, and click on the SSH and GPG keys tab. Click on New SSH key, and paste the key into the box. Give it a name, and click Add SSH key. -You should now be able to clone repos using SSH. To do this, go to the repo you want to clone, but instead of copying the HTTP link, copy the SSH link, and then its regular git cloning. +You should now be able to clone repos using SSH. To do this, go to the repo you want to clone, but instead of copying the HTTP link, copy the SSH link, and then its regular git cloning. \ No newline at end of file diff --git a/src/chapter1/nectar-login.png b/src/chapter1/nectar-login.png new file mode 100644 index 0000000..8f4fed0 Binary files /dev/null and b/src/chapter1/nectar-login.png differ diff --git a/src/chapter1/nectar.md b/src/chapter1/nectar.md new file mode 100644 index 0000000..53624a1 --- /dev/null +++ b/src/chapter1/nectar.md @@ -0,0 +1,16 @@ +# Nectar Cloud + +The ARDC Nectar Research Cloud (Nectar) is Australia’s national research cloud, specifically designed for research computing. Like with M3, we will set you up with access now before you learn about it in later chapters. [This webpage](https://ardc.edu.au/services/ardc-nectar-research-cloud/) explains what it is if you're curious. + +## Connect Monash Account to Nectar Cloud +To create an [identity](https://medium.com/@ciente/identity-and-access-management-iam-in-cloud-computing-2777481525a4) (account) in Nectar Cloud, all you have to do is login using your Monash student account. Click [this link](https://dashboard.rc.nectar.org.au) to access Nectar's landing page. + +You will see the following. Make sure to click "Login via AAF (Australia)". + +![nectar](./nectar-login.png) + +You will be redirected to enter your Monash credentials after which you will see the Nectar Cloud dashboard for your trial project (your project name will be pt-xxxxx). + +## Cloud Starter Series + +ARDC has provided [this cloud starter tutorial series](https://tutorials.rc.nectar.org.au/cloud-starter/01-overview) for people new to Nectar Cloud. You should be able to follow these tutorials using your trial project. If you need more SUs (service units aka. cloud credits) in order to provision more cloud resources for MDN-related work, you should message your HPC Lead with that request. \ No newline at end of file diff --git a/src/chapter2/challenges.md b/src/chapter2/challenges.md index e2e1edf..a09491b 100644 --- a/src/chapter2/challenges.md +++ b/src/chapter2/challenges.md @@ -16,8 +16,6 @@ The challenges for this chapter can found in the [HPC Training Challenges](https - [Challenge 4 - GCD \& LCM](#challenge-4---gcd--lcm) - [Challenge 5 - Bitwise Add](#challenge-5---bitwise-add) - [Challenge 6 - Bitwise Multiply](#challenge-6---bitwise-multiply) - - [Challenge 7 - Sum and Product Algorithms](#challenge-7---sum-and-product-algorithms) - - [Challenge 8 - Array Concatenation](#challenge-8---array-concatenation) ## Challenge 1 - Hello World @@ -44,13 +42,3 @@ For this challenge you have to implement a function called `bitwise_add()` which This challenge is similar to the last but instead of implementing `+` you must implement `*` (product). Your implementation should be contained in a function called `bitwise_multiply()`. You can use any bitwise or conditional operators. > Note: If you need `+` you can reimplement it internally in `bitwise_multiply` based on your solution from the previous challenge, import it to a header in this challenges folder and include it or copy it to this folder. Ask a trainer if you get stuck with this. - -## Challenge 7 - Sum and Product Algorithms - -This challenge involves implementing the sum and product reductions on an array or memory block of integers. As a bonus challenge, try and make the algorithms more generic and work with any binary operator. - -## Challenge 8 - Array Concatenation - -In this challenge you have to implement an array concatenation function. This should join two arrays of the same type into a single array, similar to `strcat()`. You will need to allocate a new block of memory and in order to store the concatenated arrays which will requires the sizes of the two input arrays to be known by the function. This function should return a pointer to the resulting array. - -> Note: The type of the array this function concatenates can be any type except `char`. diff --git a/src/chapter2/printing.md b/src/chapter2/printing.md index 196c36b..004b5ff 100644 --- a/src/chapter2/printing.md +++ b/src/chapter2/printing.md @@ -42,8 +42,21 @@ int main() > Question: Notice how we used `double` for the type of `sum`. What would happen if `sum` type was `int`? -If you want to have a play with `printf()`, copy the following code snippet run it on your own device. The command will be identically to 'Hello World!'. +If you want to have a play with `printf()`, copy the following code snippet run it on your own device. The command line will output different varieties of 'Hello World!'. +```c +#include + +int main() { + printf("%30s\n", "Hello World!"); // Padding added + printf("%40s%10s%20s%15s\n", "Hell", "o", "World", "!"); + printf("%10.7s\n", "Hello World!"); // Print only the first 7 characters with padding + printf("%100c%c%c%c%c %c%c%c%c%c%c%c\n", + 72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100, 33, '\n'); // Hex values + return 0; +} + +``` ### Formatting Specification You'll notice we used a different character after the `%` for each argument. This is because `printf()` needs to know the type of the incoming arguments so that it can format the string appropriately. For example floating point types have to use a decimal point when transformed into a text format while integers do not. diff --git a/src/chapter2/stdlib.md b/src/chapter2/stdlib.md deleted file mode 100644 index 80b1028..0000000 --- a/src/chapter2/stdlib.md +++ /dev/null @@ -1 +0,0 @@ -# The Standard Library diff --git a/src/chapter3/bash.md b/src/chapter3/bash.md deleted file mode 100644 index aada975..0000000 --- a/src/chapter3/bash.md +++ /dev/null @@ -1,42 +0,0 @@ -# Bash Scripts - -Bash is both a command line interface and a scripting language. Linux commands are generally using Bash. Bash scripts are a series of commands that are executed in order. Bash scripts are useful for automating tasks that you do often, or for running a series of commands that you don't want to type out every time. In our case, Bash scripts are used for running jobs on M3. - -In terms of use, Bash can encapsulate any command you would normally run in the terminal into a script that can be easily reused. For example you could have a script that automatically navigates to a directory and activates a virtual environment, or a script that compiles and runs a C program. - -The basic syntax of a bash file is as follows: - -```bash -#!/bin/bash - -# This is a comment - -echo "Hello World" -``` - -We can save this file as `hello.sh` and run it using the following command: `source hello.sh`. This will print `Hello World` to the terminal. - -Let's walk through the file. The first line is `#!/bin/bash`. This is called a shebang, and it tells the system that this file is a bash script. The second line is a comment, and is ignored by the system. The third line is the actual command that we want to run. In this case, we are using the `echo` command to print `Hello World` to the terminal. - -Bash can do a lot more, including basic arithmetic, if statements, loops, and functions, however these are not really necesary for what we are doing. If you want to learn more about bash, you can find a good tutorial [here](https://linuxconfig.org/bash-scripting-tutorial). - -For our use, the main things we need to be able to do are to run executables and files. We do this the exact same way as if manually running them in the terminal. For example, if we want to run a python script, we can do the following: - -```bash -#!/bin/bash - -# This will run hello.py using the python3 executable -python3 hello.py -``` - -If we want to compile and then run a C program, we can do the following: - -```bash -#!/bin/bash - -# This will compile hello.c and then run it -gcc hello.c -o hello -./hello -``` - -Using bash scripts not only saves a lot of time and effort, but it also makes it easier to run jobs on M3 using SLURM. We will go over how to do this in the next section. diff --git a/src/chapter3/challenges.md b/src/chapter3/challenges.md index 39f5e4f..a31dd95 100644 --- a/src/chapter3/challenges.md +++ b/src/chapter3/challenges.md @@ -1,45 +1,3 @@ -# M3 Challenges +# Challenges -## Challenge 1 - -Navigate to your scratch directory and, using vim (or your chosen in-terminal editor) create a file called `hello.txt` that contains the text "Hello World". Once you have created the file, use the `cat` command to print the contents of the file to the screen. - -## Challenge 2 - -Write a bash script that prints the contents of the above hello.txt file to the screen and run it locally (on your login node). - -## Challenge 3 - -Submit the above script to the queue by writing another SLURM bash script. Check the status of the job using `squeue`. Once the job has finished, check the output using `cat`. You can find the output file in the directory you submitted the job from. - -## Challenge 4 - -Request an interactive node and attach to it. Once you have done this, install python 3.7 using conda. - -## Challenge 5 - -Clone and run [this](./dl_on_m3/alexnet_stl10.py) script. You will need to first install the dependencies for it. You don't need to wait for it to finish, just make sure it is working. You will know its working if it starts listing out the loss and accuracy for each epoch. You can stop it by pressing `ctrl + c`. - -Once you have confirmed that it is working, deactivate and delete the conda environment, and then end the interactive session. - -> Hint: I have included the dependencies and their versions (make sure you install the right version) in the `requirements.txt` file. You will need python 3.7 to run this script. - -## Challenge 6 - -Go back to the login node. Now you are going to put it all together. Write a bash script that does the following: - -- (1) requests a compute node -- (2) installs python using conda -- (3) clones and runs the above script - -Let this run fully. Check the output of the script to make sure it ran correctly. Does it match the output of the script you ran in challenge 5? -> Hint: You can check the output of the script at any time by `cat`ing the output file. The script does not need to have finished running for you to do this. - -## Challenge 7 - -Edit your submission script so that you get a gpu node, and run the script using the gpu. -> Hint: Use the m3h partition - -## Challenge 8 - -Now you want to clean up your working directory. First, push your solutions to your challenges repo. Then, delete the challenges directory, as well as the conda environment you created in challenge 6. +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter3/chapter3.md b/src/chapter3/chapter3.md index 16097dd..90783b0 100644 --- a/src/chapter3/chapter3.md +++ b/src/chapter3/chapter3.md @@ -1,7 +1,9 @@ -# M3 +# Operating Systems -[M3](https://docs.massive.org.au/M3/index.html) is part of [MASSIVE](https://https://www.massive.org.au/), which is a High Performance Computing facility for Australian scientists and researchers. Monash University is a partner of MASSIVE, and provides as majority of the funding for it. M3 is made up of multiple different types of servers, with a total of 5673 cores, 63.2TB of RAM, 5.6PB of storage, and 1.7 million CUDA cores. +A decent chunk of HPC involves using low-level tools and techniques to find optimisations and make software run faster. The main reason we use C is that it gives us access to deeper parts of the computer that are normally hidden away and managed on your behalf by your Python or Java interpreter. -M3 utilises the [Slurm](https://slurm.schedmd.com/) workload manager, which is a job scheduler that allows users to submit jobs to the cluster. We will learn a bit more about this later on. +![comp-levels](./imgs/programming-levels.jpg) -This book will take you through the basics of connecting to M3, submitting jobs, transferring data to and from the system and some other things. If you want to learn more about M3, you can read the [M3 documentation](https://docs.massive.org.au/M3/index.html). This will give you a more in-depth look at the system, and how to use it. +> **Note:** Not all low-level, machine (Assembly) code is faster than high-level code. The primary reason that lower level coding tends to be faster is that it avoids a lot of the overhead (eg. garbage collection) involved in executing higher level code. + +If you have done FIT2100 Operating Systems, this chapter would mostly be a refresher for you. It's intended to provide you with a crash course intro to operating systems theory so that you are capable of using low-level tools and implementing things like cache optimisations. \ No newline at end of file diff --git a/src/chapter3/computer-architecture.md b/src/chapter3/computer-architecture.md new file mode 100644 index 0000000..618e9fe --- /dev/null +++ b/src/chapter3/computer-architecture.md @@ -0,0 +1,66 @@ +# Computer Architecture + +Nearly all modern computers are based on the Von-Neumann architecture which was introduced way back in 1945. Historically there have been 2 types of Computers: + +1. **Fixed Program Computers:** Their function is very specific and they couldn’t be reprogrammed, e.g. Calculators. +2. **Stored Program Computers:** These can be programmed to carry out many different tasks, applications are stored on them, hence the name. + +Modern computers are based on a stored-program concept introduced by John Von Neumann. In this stored-program concept, both software programs (code) and data (text, images, metadata, etc...) are stored in a separate storage unit called memory and are treated the same. This novel idea meant that a computer built with this architecture would be much easier to reprogram. + +![basic-arch](./imgs/basic_structure.png) + +### Central Processing Unit (CPU) +Hopefully everyone here knows what a CPU is and what it generally does. We've got a brief description of it's components and a lower level view of its hardware design to give you guys some relevant insight. + +1. **Control Unit:** A control unit directs all input and output flow, fetches code for instructions, and controls how data moves around the system. I think of it as the brains behind the CPU (the CPU of the CPU). +2. **Arithmetic Logic Unit (ALU):** The ALU is the CPU component that handles all the calculations the CPU may need, e.g. Addition, Subtraction, Comparisons. It performs Logical Operations, Bit Shifting Operations, and Arithmetic operations. + +### Overview of Hardware Components +You don't need to remember everything here but as we pursue more hardware-related and low-level projects (Cluster Dev, potentially something in robotics...), these concepts are hopefully a good foundation for that. + +![basic-hardware](./imgs/vn_cpu.png) + +#### Registers +Registers refer to high-speed storage areas in the CPU. They are the most expensive but also the fastest unit of memory. The data processed by the CPU are fetched from the registers. There are different types of registers used in architecture: +- **Accumulator:** Stores the results of calculations made by ALU. It holds the intermediate of arithmetic and logical operatoins.it act as a temporary storage location or device. +- **Program Counter (PC):** Keeps track of the memory location of the next instructions to be dealt with. The PC then passes this next address to the Memory Address Register (MAR). +- **Memory Address Register (MAR):** It stores the memory locations of instructions that need to be fetched from memory or stored in memory. +- **Memory Data Register (MDR):** It stores instructions fetched from memory or any data that is to be transferred to, and stored in, memory. +- **Current Instruction Register (CIR):** It stores the most recently fetched instructions while it is waiting to be coded and executed. +- **Instruction Buffer Register (IBR):** The instruction that is not to be executed immediately is placed in the instruction buffer register IBR. + +#### Buses +You can think of a bus as a group of tiny wires that carry data (64 tiny wires = 64 bit bus). Data is transmitted from one part of a computer to another, connecting all major internal components to the CPU and memory, by buses. Types: +- **Data Bus:** It carries data among the memory unit, the I/O devices, and the processor. +- **Address Bus:** It carries the address of data (not the actual data) between memory and processor. +- **Control Bus:** It carries control commands from the CPU (and status signals from other devices) in order to control and coordinate all the activities within the computer. + +#### Input/Output Devices +Program or data is read into main memory from the input device or secondary storage under the control of CPU input instruction. Output devices are used to output information from a computer. If some results are evaluated by the computer and it is stored in the computer, then with the help of output devices, we can present them to the user. + +#### Von Neumann bottleneck +Whatever we do to enhance performance, we cannot get away from the fact that instructions can only be done one at a time and can only be carried out sequentially. Both of these factors hold back the competence of the CPU. We can provide a Von Neumann processor with more cache, more RAM, or faster components but if original gains are to be made in CPU performance then a full hardware upgrade needs to happen. This is what motivates the use of parallel & distributed computing in HPC - it allows us to overcome this limitation without expensive hardware upgrades (details in later chapters). + + +## Memory Heirarchy + +It's very important that you understand these concepts as they go to the heart of memory-based optimisations in HPC. The main purpose behind designing a memory heirarchy is to use [locality of reference](https://en.wikipedia.org/wiki/Locality_of_reference) to minimise access times as much as possible. + +As you can see in the image below, there are multiple levels of memory with levels closer to the top (closer to the CPU) being smaller, more expensive but much faster than the ones to the bottom. + +![memory-heirarchy](./imgs/Memory-Hierarchy-Design.png) + +We can broadly divide this heirarchy into two: +1. **Primary Memory:** This comprises Main Memory, Cache Memory & CPU registers and it's directly accessible by the processor. +2. **Secondary Memory:** This is handled by external devices like Magnetic Disk, Optical Disk, and Magnetic Tape i.e. peripherals which are accessible by the processor via an I/O module and device driver. + +Putting this heirarchy into perspective, we have some indicative stats below for each type of memory. + +| Level | 1 | 2 | 3 | 4 | +| --- | --- | --- | --- | --- | +| Name | Register | Cache | Main Memory | Secondary Memory | +| Size | <1 KB | <16 MB |<16GB | >100 GB | +| Hardware | D-latches | SRAM | DRAM | Magnetic | +| Access Time | 0.25-0.5ns | 0.5-25ns | 80-250ns | 500000 ns | +| Bandwidth | 20-100 GB | 5-15 GB | 1-5 GB | 20-150 MB | +| Managed by | Compiler | Hardware | Operating System | Operating System | \ No newline at end of file diff --git a/src/chapter3/imgs/Memory-Hierarchy-Design.png b/src/chapter3/imgs/Memory-Hierarchy-Design.png new file mode 100644 index 0000000..7f8dea0 Binary files /dev/null and b/src/chapter3/imgs/Memory-Hierarchy-Design.png differ diff --git a/src/chapter3/imgs/Structure-Of-Linux-Operating-System.png b/src/chapter3/imgs/Structure-Of-Linux-Operating-System.png new file mode 100644 index 0000000..afee076 Binary files /dev/null and b/src/chapter3/imgs/Structure-Of-Linux-Operating-System.png differ diff --git a/src/chapter3/imgs/basic_structure.png b/src/chapter3/imgs/basic_structure.png new file mode 100644 index 0000000..8a35e96 Binary files /dev/null and b/src/chapter3/imgs/basic_structure.png differ diff --git a/src/chapter3/imgs/hpcid.png b/src/chapter3/imgs/hpcid.png deleted file mode 100644 index be747b6..0000000 Binary files a/src/chapter3/imgs/hpcid.png and /dev/null differ diff --git a/src/chapter3/imgs/join_project.png b/src/chapter3/imgs/join_project.png deleted file mode 100644 index 070d055..0000000 Binary files a/src/chapter3/imgs/join_project.png and /dev/null differ diff --git a/src/chapter3/imgs/programming-levels.jpg b/src/chapter3/imgs/programming-levels.jpg new file mode 100644 index 0000000..03d305c Binary files /dev/null and b/src/chapter3/imgs/programming-levels.jpg differ diff --git a/src/chapter3/imgs/vn_cpu.png b/src/chapter3/imgs/vn_cpu.png new file mode 100644 index 0000000..12334ab Binary files /dev/null and b/src/chapter3/imgs/vn_cpu.png differ diff --git a/src/chapter3/linux-cmds.md b/src/chapter3/linux-cmds.md deleted file mode 100644 index 7057ccc..0000000 --- a/src/chapter3/linux-cmds.md +++ /dev/null @@ -1,47 +0,0 @@ -# Linux Commands - -Even if you are already familiar with linux, please read through all of these commands, as some are specific to M3. - -## Basic Linux Commands - -| Command | Function | -| --- | --- | -| `pwd` | prints current directory | -| `ls` | prints list of files / directories in current directory (add a `-a` to list everything, including hidden files/directories | -| `mkdir` | makes a directory | -| `rm ` | deletes *filename*. add `-r` to delete directory. add `-f` to force deletion (be really careful with that one) | -| `cd ` | move directory. | -| `vim` or `nano` | bring up a text editor | -| `cat ` | prints contents of file to terminal | -| `echo` | prints whatever you put after it | -| `chmod ` | changes permissions of file | -| `cp` | copy a file or directory| -| `mv ` | move or rename file or directory | - -> Note: `.` and `..` are special directories. `.` is the current directory, and `..` is the parent directory. These can be used when using any command that takes a directory as an argument. Similar to these, `~` is the home directory, and `/` is the root directory. For example, if you wanted to copy something from the parent directory to the home directory, you could do `cp ../ ~/`, without having to navigate anywhere. - -## Cluster Specific Commands - -| Command | Function | Flags -| --- | --- | --- | -| `show_job` | prints information about your jobs | -| `show_cluster` | prints information about the cluster | -| `user_info` | prints information about your account | -| `squeue` | prints information about your jobs | `-u ` to print information about a specific user | -| `sbatch ` | submit a job to the cluster | -| `scontrol show job ` | prints information about a specific job | -| `scancel ` | cancel a job | - -## M3 Specific Commands - -| Command | Function | -| --- | --- | -| `module load ` | load a module | -| `module unload ` | unload a module | -| `module avail` | list available modules | -| `module list` | list loaded modules | -| `module spider ` | search for a module | -| `module help ` | get help for a module | -| `module show ` | show details about a module | -| `module purge` | unload all modules | -| `module swap ` | swap two modules | \ No newline at end of file diff --git a/src/chapter3/linux-intro.md b/src/chapter3/linux-intro.md new file mode 100644 index 0000000..a143ee7 --- /dev/null +++ b/src/chapter3/linux-intro.md @@ -0,0 +1,10 @@ +# Introduction to Linux + +Linux is one of the most popular versions of the UNIX operating System. It is open source as its source code is freely available. It is free to use. Linux was designed considering UNIX compatibility. Its functionality list is quite similar to that of UNIX. + +Linux Operating System has primarily three components: +- **Kernel:** The kernel is the core part of Linux. It is responsible for all major activities of this operating system. It consists of various modules and it interacts directly with the underlying hardware. Kernel provides the required abstraction to hide low level hardware details to system or application programs. +- **System Library:** System libraries are special functions or programs using which application programs or system utilities access Kernel’s features. These libraries implement most of the functionalities of the operating system and do not require kernel module’s code access rights. +- **System Utility:** System Utility programs are responsible to do specialised, individual level tasks. + +![linux-struct](./imgs/Structure-Of-Linux-Operating-System.png) \ No newline at end of file diff --git a/src/chapter2/pointers.md b/src/chapter3/memory-pointers.md similarity index 100% rename from src/chapter2/pointers.md rename to src/chapter3/memory-pointers.md diff --git a/src/chapter3/processes.md b/src/chapter3/processes.md new file mode 100644 index 0000000..ce5138d --- /dev/null +++ b/src/chapter3/processes.md @@ -0,0 +1,3 @@ +# Processes + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter3/scheduling.md b/src/chapter3/scheduling.md new file mode 100644 index 0000000..ddc6525 --- /dev/null +++ b/src/chapter3/scheduling.md @@ -0,0 +1,3 @@ +# Scheduling Algorithms + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter3/slurm.md b/src/chapter3/slurm.md deleted file mode 100644 index 8cc0cf5..0000000 --- a/src/chapter3/slurm.md +++ /dev/null @@ -1,70 +0,0 @@ -# Job batching & SLURM - -Launching and running jobs on M3 is controlled by [SLURM](https://slurm.schedmd.com/). You don't really need to know a lot about it in order to use it, so this section will take you through the basics of what you will need for what we are doing. - -If you want a complete guide on SLURM in M3, you can find it [here](https://docs.massive.org.au/M3/slurm/slurm-overview.html). - -## Submitting simple jobs - -As we discussed in the previous section we use bash scripts to run jobs on M3. We can submit these jobs using the `sbatch` command. For example, if we have a bash script called `hello.sh` that contains the following: - -```bash -#!/bin/bash - -#SBATCH --ntasks=1 -#SBATCH --mem=1MB -#SBATCH --time=0-00:01:00 -#SBATCH --job-name=hello -#SBATCH --partition=m3i -#SBATCH --mail-user=@student.monash.edu -#SBATCH --mail-type=BEGIN,END,FAIL - -echo "Hello World" -``` - -We can submit this job using the following command: - -`sbatch hello.sh` - -This will submit the job to the queue, and you will get an email when the job starts, finishes, or fails. You can also check the status of your job using the `squeue` command. - -## Options - -You might have noticed the `#SBATCH` lines in the bash script. These are called options, and they tell SLURM how to run the job. The options we used in the example above are: - -- `ntasks`: The number of tasks or processes to run. -- `mem`: The amount of memory to allocate to the job. -- `time`: The maximum amount of time the job can run for. -- `job-name`: The name of the job. Up to 15 characters. -- `partition`: The partition to run the job on. -- `mail-user`: The email address to send job status emails to. -- `mail-type`: The types of emails to send. - -> Note: In the case of M3, a task is essentially the same as a process. This is **not** the same as a cpu core. You can have a task that uses one or multiple cores. You can also have multiple tasks comprising the same job, each with one or multiple cores being utilised. It can get quite confusing, so if you are unsure about what you need, just ask. There is also more information in the M3 docs. - -There are a lot more options that you can use, and you can find a more complete list [here](https://docs.massive.org.au/M3/slurm/simple-batch-jobs.html). - -In particular, if you want to run multithreading or multiprocessing jobs, or you need a gpu, there are more options you need to configure. - -## Interactive jobs - -Sometimes you might want to actually connect to the node that you are running your job on, in order to see what is happening or to set it up before running the job. You can do this using the `smux` command. Similar to regular batch jobs, you can set options when you start the interactive session. An example of this is: - -`smux new-session --ntasks=1 --time=0-00:01:00 --partition=m3i --mem=4GB` - -This will start an interactive session on a node with 1 cpu, 1 minute of time, and 4GB of memory. There are again other options available, and you can find a more complete explanation [here](https://docs.massive.org.au/M3/slurm/interactive-jobs.html). - -### Connecting to interactive jobs - -Typically when you start an interactive job it will not start immediately. Instead, it will be queued up once it has started you will need to connect to it. You can do this by running `smux a`, which will reconnect you to the session. If you want to disconnect from the session but leave it running, you can press `Ctrl + b` followed by `d`. This will disconnect you from the session, but leave it running. You can reconnect to it later using `smux a`. If you want to kill the session, if you are connected just run `exit`, otherwise if you are in a login node run `scancel `. You can find the job id using `show_job`. - -## Checking the status of jobs, finding out job IDs, and killing jobs - -A couple of useful commands for general housekeeping are: - -- `squeue`: This will show you the status of all jobs currently running on M3. -- `show_job`: This will show you the status of all jobs you have submitted. -- `squeue -u `: This will show you the status of all jobs submitted by a particular user currently running. -- `scancel `: This will kill a job with a particular job id. -- `scancel -u `: This will kill all jobs submitted by a particular user. -- `show_cluster`: This will show you the status of the cluster, including any nodes that are offline or in maintenance. diff --git a/src/chapter3/strudel.md b/src/chapter3/strudel.md deleted file mode 100644 index 2b34a9f..0000000 --- a/src/chapter3/strudel.md +++ /dev/null @@ -1,31 +0,0 @@ -# Strudel - -STRUDEL is a web application used to connect to M3. There are two main benefits to this over regular ssh. Firstly, you are able to access a desktop session, so you can interact easier with M3, look at graphs, etc.. STRUDEL also enables the use of Jupyter notebooks, which are especially useful for data science and machine learning. - -## Accessing STRUDEL - -First, go to the [STRUDEL](https://beta.desktop.cvl.org.au/) website. You should see something like this: - -![strudel select cvl](imgs/strudel1.png) - -Select the CVL option, and you should be taken to another page, where you choose how to log in. - -![strudel login](imgs/strudel2.png) - -Select AAF. On the next page, search for and select Monash University. - -![AAF Login](imgs/aaf_strudel.png) - -You will now be taken to the Monash login page. Once you have logged in, it will show one last page, asking permission to use your details. Click allow, and you will be taken to the STRUDEL home page. - -![strudel home page](imgs/strudel_home.png) - -## Desktop Session - -To start a desktop session using STRUDEL, click on the **Desktop** tab on the side, select your desired options, and click launch. Once the session has started, you will be able to attach to it by clicking on the connect button in the *Pending / Running Desktops* section. - -## Jupyter Notebooks - -Similar to Desktops, if you want a basic Jupyter notebook, click on the **Jupyter Lab** tab, choose how much compute you want, and click launch. - -If you want to have a more customised Jupyter notebook, you can do this by first sshing into M3, and activate conda. Then activate the conda environment `jupyterlab`. Install you desired packages in this environment. Once you have done this, go back to STRUDEL, and launch a **Jupyter Lab - BYO** session. \ No newline at end of file diff --git a/src/chapter3/threads-concurrency.md b/src/chapter3/threads-concurrency.md new file mode 100644 index 0000000..b92708b --- /dev/null +++ b/src/chapter3/threads-concurrency.md @@ -0,0 +1,3 @@ +# Threading & Concurrency + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter4/challenges.md b/src/chapter4/challenges.md index 33b071c..2986a5d 100644 --- a/src/chapter4/challenges.md +++ b/src/chapter4/challenges.md @@ -1,51 +1,11 @@ -# Parallel Computing Challenges +# Challenges -## Overview +## Challenge 1 - Sum and Product Algorithms -- [Parallel Computing Challenges](#parallel-computing-challenges) - - [Overview](#overview) - - [Pre-Tasks](#pre-tasks) - - [Task 1 - Single Cluster Job using OpenMP](#task-1---single-cluster-job-using-openmp) - - [Task 2 - Parallel `for` Loop](#task-2---parallel-for-loop) - - [Task 3 - Parallel Reductions](#task-3---parallel-reductions) - - [Task 4 - Laplace Equation for Calculating the Temperature of a Square Plane](#task-4---laplace-equation-for-calculating-the-temperature-of-a-square-plane) - - [Task 5 - Calculate Pi using "Monte Carlo Algorithm"](#task-5---calculate-pi-using-monte-carlo-algorithm) +This challenge involves implementing the sum and product reductions on an array or memory block of integers. As a bonus challenge, try and make the algorithms more generic and work with any binary operator. -## Pre-Tasks +## Challenge 2 - Array Concatenation -Make sure to clone a copy of **your** challenges repo onto M3, ideally in a personal folder on vf38_scratch. +In this challenge you have to implement an array concatenation function. This should join two arrays of the same type into a single array, similar to `strcat()`. You will need to allocate a new block of memory and in order to store the concatenated arrays which will requires the sizes of the two input arrays to be known by the function. This function should return a pointer to the resulting array. -> Note: For every challenge you will be running the programs as SLURM jobs. This is so we don't overload the login nodes. A template [SLURM job script](./job.slurm) is provided at the root of this directory which you can use to submit your own jobs to SLURM by copying it to each challenges sub-directory and filling in the missing details. You may need more than one for some challenges. This template will put the would-be-printed output in a file named `slurm-.out`. - -## Task 1 - Single Cluster Job using OpenMP - -Create a program in `hello.c` that prints 'Hello, world from thread: ' to the output. Launch the job to a node SLURM. - -> Note: -> -> - The output of a job is put in a slurm-.out file by default. -> - The template slurm job scripts will output the results to a `slurm-.out` file. - -## Task 2 - Parallel `for` Loop - -In `array-gen.c` implement a program that generates an array containing the numbers 0..10'000 elements (inclusive) using a `for` loop. Measure the execution time using the `time` Linux command. Now reimplement the program to utilise OpenMP's parallel `for` loop macros, measuring the execution time again. Is there any performance improvement? Are the elements still in the correct order and if not how can you fix this. Try experimenting with different sized arrays and element types. - -> Hint: You will likely need to allocate memory from the heap. - -## Task 3 - Parallel Reductions - -In the C chapter we created a sum program that summed the elements of an array together. Using this as a base, create a new program that again computes the sum of the elements of an array but using OpenMP, comparing the execution time between the sequential and parallel versions. Is there any performance improvement? How would using a different binary operator change our ability to parallelize the the reduction? - -If you have time, implement the sum but at each iteration, raise the current value to the power of the current accumulation divide by 100, adding this to the accumulation. Test a serial and parallel version. Is the parallel any faster? - -> Note: `module load gcc` to use newer version of gcc if you have error with something like `-std=c99`. - -## Task 4 - Laplace Equation for Calculating the Temperature of a Square Plane - -For this challenge you will attempt to parallelize an existing implementation of the Laplace Equation. Throughout the source files of this project there are various loops you can try and make faster by utilizing OpenMP macros. See if you can make a faster version in the `laplace2d-parallel.c`. To build these files make sure you're in that directory and use the command `make`. The executables will be in the same directory. - -## Task 5 - Calculate Pi using "Monte Carlo Algorithm" - -For this challenge you will have to try and implement the Monte Carlo algorithm with no framework or template and using everything you've learnt so far. Good luck. - -[Short explanation of Monte Carlo algorithm](https://www.youtube.com/watch?v=7ESK5SaP-bc&ab_channel=MarbleScience) +> Note: The type of the array this function concatenates can be any type except `char`. \ No newline at end of file diff --git a/src/chapter4/chapter4.md b/src/chapter4/chapter4.md index 00f7e45..27cf36f 100644 --- a/src/chapter4/chapter4.md +++ b/src/chapter4/chapter4.md @@ -1,3 +1,3 @@ -# Parallel Computing +# More C -In this chapter we discuss parallel computing and its uses in developing fast applications. We then look at how OpenMP allows us to parallelize or code to make it faster. +This chapter will walk you through the more intermediate features of the C language. It aims to build on the theoretical knowledge of operating systems you gained in the last chapter with the practical skills to actually use it. You will learn about other C language constructs, memory allocation, system calls (the Kernel's API) and actually spawning processes & threads. \ No newline at end of file diff --git a/src/chapter4/imgs/4 Parallel Computing OpenMP.gif b/src/chapter4/imgs/4 Parallel Computing OpenMP.gif deleted file mode 100644 index 1006c16..0000000 Binary files a/src/chapter4/imgs/4 Parallel Computing OpenMP.gif and /dev/null differ diff --git a/src/chapter4/imgs/Distributed Memory Architecture.png b/src/chapter4/imgs/Distributed Memory Architecture.png deleted file mode 100644 index 31b04a5..0000000 Binary files a/src/chapter4/imgs/Distributed Memory Architecture.png and /dev/null differ diff --git a/src/chapter4/imgs/Hybrid Parallel Programming.png b/src/chapter4/imgs/Hybrid Parallel Programming.png deleted file mode 100644 index 2de418b..0000000 Binary files a/src/chapter4/imgs/Hybrid Parallel Programming.png and /dev/null differ diff --git a/src/chapter4/imgs/Monte Carlo.png b/src/chapter4/imgs/Monte Carlo.png deleted file mode 100644 index 0b08413..0000000 Binary files a/src/chapter4/imgs/Monte Carlo.png and /dev/null differ diff --git a/src/chapter4/imgs/Parallel Computing Example.png b/src/chapter4/imgs/Parallel Computing Example.png deleted file mode 100644 index a5b662b..0000000 Binary files a/src/chapter4/imgs/Parallel Computing Example.png and /dev/null differ diff --git a/src/chapter4/imgs/Pasted image 20230325113147.png b/src/chapter4/imgs/Pasted image 20230325113147.png deleted file mode 100644 index ec3a879..0000000 Binary files a/src/chapter4/imgs/Pasted image 20230325113147.png and /dev/null differ diff --git a/src/chapter4/imgs/Pasted image 20230325113254.png b/src/chapter4/imgs/Pasted image 20230325113254.png deleted file mode 100644 index 375a898..0000000 Binary files a/src/chapter4/imgs/Pasted image 20230325113254.png and /dev/null differ diff --git a/src/chapter4/imgs/Pasted image 20230325113303.png b/src/chapter4/imgs/Pasted image 20230325113303.png deleted file mode 100644 index be44f48..0000000 Binary files a/src/chapter4/imgs/Pasted image 20230325113303.png and /dev/null differ diff --git a/src/chapter4/imgs/Pasted image 20230325113312.png b/src/chapter4/imgs/Pasted image 20230325113312.png deleted file mode 100644 index 3c98d3e..0000000 Binary files a/src/chapter4/imgs/Pasted image 20230325113312.png and /dev/null differ diff --git a/src/chapter4/imgs/Pasted image 20230325113329.png b/src/chapter4/imgs/Pasted image 20230325113329.png deleted file mode 100644 index 4fc23d5..0000000 Binary files a/src/chapter4/imgs/Pasted image 20230325113329.png and /dev/null differ diff --git a/src/chapter4/imgs/Pasted image 20230326141615.png b/src/chapter4/imgs/Pasted image 20230326141615.png deleted file mode 100644 index feed862..0000000 Binary files a/src/chapter4/imgs/Pasted image 20230326141615.png and /dev/null differ diff --git a/src/chapter4/imgs/Pasted image 20230326142826.png b/src/chapter4/imgs/Pasted image 20230326142826.png deleted file mode 100644 index 92186c1..0000000 Binary files a/src/chapter4/imgs/Pasted image 20230326142826.png and /dev/null differ diff --git a/src/chapter4/imgs/Running Processes in Parallel.png b/src/chapter4/imgs/Running Processes in Parallel.png deleted file mode 100644 index 8c8d66d..0000000 Binary files a/src/chapter4/imgs/Running Processes in Parallel.png and /dev/null differ diff --git a/src/chapter4/imgs/Shared Memory Architecture.png b/src/chapter4/imgs/Shared Memory Architecture.png deleted file mode 100644 index 1d20303..0000000 Binary files a/src/chapter4/imgs/Shared Memory Architecture.png and /dev/null differ diff --git a/src/chapter4/imgs/Slurm Architecture.png b/src/chapter4/imgs/Slurm Architecture.png deleted file mode 100644 index e3547ff..0000000 Binary files a/src/chapter4/imgs/Slurm Architecture.png and /dev/null differ diff --git a/src/chapter4/imgs/Thread vs Processes.png b/src/chapter4/imgs/Thread vs Processes.png deleted file mode 100644 index a1179ea..0000000 Binary files a/src/chapter4/imgs/Thread vs Processes.png and /dev/null differ diff --git a/src/chapter4/imgs/Time Command.png b/src/chapter4/imgs/Time Command.png deleted file mode 100644 index ed459a6..0000000 Binary files a/src/chapter4/imgs/Time Command.png and /dev/null differ diff --git a/src/chapter4/imgs/Top Command.png b/src/chapter4/imgs/Top Command.png deleted file mode 100644 index 713220b..0000000 Binary files a/src/chapter4/imgs/Top Command.png and /dev/null differ diff --git a/src/chapter4/imgs/sbatch Command.png b/src/chapter4/imgs/sbatch Command.png deleted file mode 100644 index feed862..0000000 Binary files a/src/chapter4/imgs/sbatch Command.png and /dev/null differ diff --git a/src/chapter4/imgs/show_cluster Command.png b/src/chapter4/imgs/show_cluster Command.png deleted file mode 100644 index a533f48..0000000 Binary files a/src/chapter4/imgs/show_cluster Command.png and /dev/null differ diff --git a/src/chapter4/imgs/squeue Command.png b/src/chapter4/imgs/squeue Command.png deleted file mode 100644 index 391a782..0000000 Binary files a/src/chapter4/imgs/squeue Command.png and /dev/null differ diff --git a/src/chapter2/macros.md b/src/chapter4/macros.md similarity index 99% rename from src/chapter2/macros.md rename to src/chapter4/macros.md index 2313b2d..dbb0641 100644 --- a/src/chapter2/macros.md +++ b/src/chapter4/macros.md @@ -93,4 +93,4 @@ The final macro worth mentioning is the `#pragma` macro. This macro is special i /// Header's content -``` +``` \ No newline at end of file diff --git a/src/chapter2/memory.md b/src/chapter4/memory.md similarity index 100% rename from src/chapter2/memory.md rename to src/chapter4/memory.md diff --git a/src/chapter4/multithreading.md b/src/chapter4/multithreading.md deleted file mode 100644 index ad0c189..0000000 --- a/src/chapter4/multithreading.md +++ /dev/null @@ -1,65 +0,0 @@ -# Multithreading on HPC - -## Thread vs Process - -![Thread vs Processes](imgs/Thread%20vs%20Processes.png) - -When computer runs a program, your source code is loaded into RAM and process is started. -A **process** is a collection of code, memory, data and other resources. -A process runs in a unique address space. So Two processes can not see each other’s memory. - -A **thread** is a sequence of code that is executed inside the scope of the **process**. You can (usually) have multiple **threads** executing concurrently within the same process. -**Threads** can view the memory (i.e. variables) of other threads within the same process - -A **multiprocessing** system has more than two processors, whereas **multithreading** is a program execution technique that allows a single process to have multiple code segments. - -## Architecture of a HPC Cluster (Massive) - -![Slurm Architecture](imgs/Slurm%20Architecture.png) - -The key in HPC is to write a parallel computing code that utilise multiple nodes at the same time. essentially, more computers faster your application - -## Using Massive - -### Find Available Partition - -Command: -```bash -show_cluster -``` - -![show_cluster Command](imgs/show_cluster%20Command.png) - -Before you run your job, it’s important to check the available resources. - -`show_cluster` is a good command to check the available resources such as CPU and Memory. Make sure to also check the status of the of the node, so that your jobs get started without waiting - -### Sending Jobs - -Command: -```bash -#SBATCH --flag=value -``` - -![sbatch Command](imgs/sbatch%20Command.png) - -Here is the example of shell script for running multi-threading job -`#sbatch` specifies resources and then it runs the executable named hello. - -`#sbatch` tasks specifies how many processes to run -Cpus per task is pretty self explanatory, it specifies how many cpu cores you need to run a process, this will be the number of threads used in the job -And make sure to specify which partition you are using - -### Monitor Jobs - -Command: -```bash -squeue -# or -squeue -u -``` - -![squeue Command](imgs/squeue%20Command.png) - -After you submitted your job, you can use the command squeue to monitor your job -you can see the status of your job to check whether it’s pending or running and also how long has it been since the job has started. diff --git a/src/chapter4/openmp.md b/src/chapter4/openmp.md deleted file mode 100644 index 9bcc36a..0000000 --- a/src/chapter4/openmp.md +++ /dev/null @@ -1,122 +0,0 @@ -# Parallel Computing with OpenMP - -## What is OpenMP - -OpenMP, stand for open multi-processing is an API for writing multithreaded applications - -It has a set of compiler directives and library routines for parallel applications, and it greatly simplifies writing multi-threaded code in Fortran, C and C++. - -Just few lines of additional code can make your application parallel  - -OpenMP uses shared memory architecture. It assumes all code runs on a single server - -## Threads - -![Threads Visualisation](imgs/Threads%20Visualisation.png) - -A thread of execution is the smallest instruction that can be managed independently by an operating system. - -In parallel region, multiple threads are spawned and utilises the cores on CPU - -> Only one thread exists in a serial region - -## OpenMP Compiler Directives - -Recall compiler directives in C; particularly the `#pragma` directive. These can be used to create custom functionality for a compiler and enable specialized features in-code. - -`#pragma` is a preprocessor directive that is used to provide additional information to the compiler beyond the standard language syntax. It allows programmers to give hints or directives to the compiler, which the compiler can use to optimize the code or to use specific compiler features or extensions. - -The `#pragma` directive is followed by a keyword that specifies the type of pragma and any additional parameters or options that are needed. For example, the `#pragma omp` directive is used in OpenMP parallel programming to provide hints to the compiler about how to parallelize code. Here are some examples of `#pragma` directives: -- `#pragma once`: This is a commonly used pragma in C and C++ header files to ensure that the header file is included only once in a compilation unit. This can help to prevent errors that can occur when the same header file is included multiple times. -- `#pragma message`: This pragma is used to emit a compiler message during compilation. This can be useful for providing additional information to the programmer or for debugging purposes. -- `#pragma warning`: This pragma is used to control compiler warnings. It can be used to turn specific warnings on or off, or to change the severity of warnings. -- `#pragma pack`: This pragma is used to control structure packing in C and C++. It can be used to specify the alignment of structure members, which can affect the size and layout of structures in memory. -- `#pragma optimize`: This pragma is used to control code optimization. It can be used to specify the level of optimization, or to turn off specific optimizations that may be causing problems. - -It is important to note that `#pragma` directives are compiler-specific, meaning that different compilers may interpret them differently or may not support certain directives at all. It is important to check the documentation for a specific compiler to understand how it interprets `#pragma` directives. - -OpenMP provides a set of `#pragma` directives that can be used to specify the parallelization of a particular loop or section of code. For example, the `#pragma omp parallel` directive is used to start a parallel region, where multiple threads can execute the code concurrently. The `#pragma omp for` directive is used to parallelize a loop, with each iteration of the loop being executed by a different thread. - -Here's an example of how `#pragma` directives can be used with OpenMP to parallelize a simple loop: - -```c -#include -#include - -int main() { - int i; - #pragma omp parallel for - for (i = 0; i < 10; i++) { - printf("Thread %d executing iteration %d\n", omp_get_thread_num(), i); - } - return 0; -} -``` - -Use `gcc -fopenmp` to compile your code when you use `#pragma` - -## Compile OpenMP - -1. Add `#include if you are using OpenMP function` -2. Run `gcc -fopenmp -o hello hello.c` - -## How it works - -![OpenMP and Directive](imgs/OpenMP%20and%20Directive.png) -[Source](https://www.researchgate.net/figure/OpenMP-API-The-master-thread-is-indicated-with-T-0-while-inside-the-parallel-region_fig3_329536624 -) - -Here is an example of `#pragma` -- The function starts with serial region -- At the line `#pragma omp parallel`, a group of threads are spawned to create parallel region inside the bracket -- At the end of the bracket, the program goes back to serial computing - -## Running "Hello World" on Multi-threads - ->If you're unsure about the difference between **multi-threading** and **multi-processing**, check the page [here](multithreading.md) - -**Drawing in Serial (Left) vs Parallel (Right)** -![](imgs/4%20Parallel%20Computing%20OpenMP.gif) - -Drawing in serial versus drawing in parallel, you can see how we can place one pixel at a time and take a long time to make the drawing, but on the right hand side if we choose to load and place four pixels down simultaneously we can get the picture faster, however during the execution it can be hard to make out what the final image will be, given we don’t know what pixel will be placed where in each execution step. - -Now this is obviously a fairly abstract analogy compared to exactly what’s happening under the hood, however if we go back to the slide diagram containing zones of multiple threads and serial zones, some parts of a program must be serial as if this program went further and drew a happy face and then a frown face, drawing both at the same time is not useful to the program, yes it would be drawn faster but the final image won’t make sense or achieve the goal of the program. - -## How many threads? You can dynamically change it - -**`omp_set_num_threads()` Library Function** -Value is set inside program. Need to recompile program to change - -**`OMP_NUM_THREADS` Environment Variable** - -```bash -export OMP_NUM_THREADS=4 -./hello -``` - -The operating system maps the threads to available hardware. You would not normally want to exceed the number of cores/processors available to you. - -## Measuring Performance - -The command `top` or `htop` looks into a process. As you can see from the image on right, it shows the CPU usages. - -![Top Command](imgs/Top%20Command.png) - -The command `time` checks the overall performance of the code. - -![Time Command](imgs/Time%20Command.png) - -By running this command, you get real time, user time and system time. - -**Real** is wall clock time - time from start to finish of the call. This includes the time of overhead - -**User** is the amount of CPU time spent outside the kernel within the process - -**Sys** is the amount of CPU time spent in the kernel within the process. -**User** time + **Sys** time will tell you how much actual CPU time your process used. - -## More Features of OpenMP - -- [YouTube Video: Introduction to OpenMP](https://www.youtube.com/watch?v=iPb6OLhDEmM&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG&index=11 ) -- [YouTube Video: Data environment -\#pragma omp parallel private](https://www.youtube.com/watch?v=dlrbD0mMMcQ&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG&index=17) -- [YouTube Video: Parallel Loops - \#omp parallel for reduction()](https://www.youtube.com/watch?v=iPb6OLhDEmM&list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG&index=11 ) diff --git a/src/chapter4/parallel-computing.md b/src/chapter4/parallel-computing.md deleted file mode 100644 index 92836b9..0000000 --- a/src/chapter4/parallel-computing.md +++ /dev/null @@ -1,41 +0,0 @@ -# Introduction to Parallel Computing - -## What is Parallel Computing? - -Parallel computing is about executing the instructions of the program simultaneously - -One of the core values of computing is the breaking down of a big problem into smaller easier to solve problems, or at least smaller problems. - -In some cases, the steps required to solve the problem can be executed simultaneously (in parallel) rather than sequentially (in order) - -A supercomputer is not just about fast processors. It is multiple processors working together in simultaneously. Therefore it makes sense to utilise parallel computing in the HPC environment, given the access to large numbers of processors - -![Running Processes in Parallel](imgs/Running%20Processes%20in%20Parallel.png) - -An example of parallel computing looks like this. - -![Parallel Computing Example](imgs/Parallel%20Computing%20Example.png) - -Here there is an array which contains numbers from 0 to 999. The program is to increment each values by 1. Comparing serial code on left and parallel code on right, parallel code is utilising 4 cores of a CPU. Therefore, it can expect approximately 4 times speed up from just using 1 core, what we are seeing here is how the same code can in-fact execute faster as four times as many elements can be updated in the same time one would be. - -## Parallel Computing Memory Architectures - -Parallel computing has various memory architectures - -### Shared Memory Architecture: - -There is shared memory architectures where multiple CPUs runs on the same server. OpenMP uses this model - -![Shared Memory Architecture](imgs/Shared%20Memory%20Architecture.png) - -### Distributed Memory Architecture: - -This distributed memory architecture where CPU and memory are bundled together and works by communicating with other nodes. Message passing protocol called lMPI is used in this model - -![Distributed Memory Architecture](imgs/Distributed%20Memory%20Architecture.png) - -### Hybrid Parallel Programming: - -For High Performance Computing (HPC) applications, OpenMP is combined with MPI. This is often referred to as Hybrid Parallel Programming. - -![Hybrid Parallel Programming](imgs/Hybrid%20Parallel%20Programming.png) \ No newline at end of file diff --git a/src/chapter4/spawn-procs.md b/src/chapter4/spawn-procs.md new file mode 100644 index 0000000..13aa1d0 --- /dev/null +++ b/src/chapter4/spawn-procs.md @@ -0,0 +1,3 @@ +# Spawning Processes & Threads + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter4/structs.md b/src/chapter4/structs.md new file mode 100644 index 0000000..ff6d028 --- /dev/null +++ b/src/chapter4/structs.md @@ -0,0 +1,30 @@ +# Structures + +So far we have only been able to to manipulate primitive data types and collections of a single type but what if we want to manipulate and store data that is of different types. This is where structures come in. Structures are used to hold data of different types in a compact format. Structures are created using the `struct` keyword paired with a unique name followed by a brace scope of variable declarations. To then create a variable of the structure type you again use the `struct` keyword and the structures type name followed by a variable name. You can then initialise the fields using a comma separated list, enclosed in braces where each element is the desired value for initialising the field of the structure. The fields are then accessed using the variable and the member access operator (`.`) paired with the field's name. + +```c +#include + +struct A +{ + int i; + double d; + char* c; +}; + +int main() +{ + struct A a = { 5, 576.658, "Hello" }; + printf("%d\n", a.i); + printf("%f\n", a.d); + printf("%s\n", a.c); + + return 0; +} +``` + +> **Note:** +> +> - Structures do not support methods. +> - Elements in a structure a layed out contiguously ie. each element is right next to each other. +> - The size of a structure can be obtained normally using `sizeof`. diff --git a/src/chapter4/syscalls.md b/src/chapter4/syscalls.md new file mode 100644 index 0000000..2e229eb --- /dev/null +++ b/src/chapter4/syscalls.md @@ -0,0 +1,3 @@ +# System Calls + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter5/batch-cloud.md b/src/chapter5/batch-cloud.md new file mode 100644 index 0000000..6919911 --- /dev/null +++ b/src/chapter5/batch-cloud.md @@ -0,0 +1,29 @@ +# Batch Processing vs. Cloud Computing + +You are all likely familiar with the definition of High Performance Computing. Here is one from IBM, + +> High-performance computing (HPC) is technology that uses clusters of powerful processors that work in parallel to process massive multi-dimensional data sets, also known as big data, and solve complex problems at extremely high speeds. HPC solves some of today’s most complex computing problems in real time. + +But the term HPC is not really used much outside the scientific research community. A lot of cloud systems involve similar scale of hardware, parallel & distributed computing, similar computational workload, data processing capacity and low latency/high throughput capability as HPC clusters. *So what exactly is the difference between a cloud system and a HPC cluster?* + +At the end of the day this comes down to semantics but a key difference is that a HPC cluster implies a system primarily used for **batch processing** whereas a cloud system would involve **interactive processing**. + +## Key Differences + +The vast majority of computer systems and nearly 100% of the ones that the average person uses is a cloud-based interactive system. Due to the nature of use cases specific to researchers, batch processing is a much more suitable choice for them. + +__Batch Processing:__ +- Jobs (code scripts) submitted are executed at a later time. +- User can't interact (or only limited interaction). +- Performance measure is **throughput**. +- Snapshot of output is used for debugging. + +![batch-image](./imgs/batch-processing.jpeg) + +__Interactive Processing:__ +- Jobs submitted are executed immediately. +- User can interact. +- Performance measure is **response time**. +- Interactive debugging. + +![interactive-image](./imgs/interactive-processing.png) \ No newline at end of file diff --git a/src/chapter5/challenges.md b/src/chapter5/challenges.md index e3c3cc5..39f5e4f 100644 --- a/src/chapter5/challenges.md +++ b/src/chapter5/challenges.md @@ -1,54 +1,45 @@ -# Distributed Computing Challenges +# M3 Challenges -## Overview +## Challenge 1 -- [Distributed Computing Challenges](#distributed-computing-challenges) - - [Overview](#overview) - - [Pre-Tasks](#pre-tasks) - - [Task 1 - Multinode 'Hello, world!'](#task-1---multinode-hello-world) - - [Task 2 - Ping Pong](#task-2---ping-pong) - - [Task 3 - Multinode Sum](#task-3---multinode-sum) - - [Task 4 - Multinode Mergesort](#task-4---multinode-mergesort) +Navigate to your scratch directory and, using vim (or your chosen in-terminal editor) create a file called `hello.txt` that contains the text "Hello World". Once you have created the file, use the `cat` command to print the contents of the file to the screen. -## Pre-Tasks +## Challenge 2 -For each task you will need to load MPICH using Spack from within your SLURM job script. There is a shared installation of Spack and MPICH within `vf38_scratch`. To load Spack and MPICH use the following to commands within you SLURM job script before any other command. +Write a bash script that prints the contents of the above hello.txt file to the screen and run it locally (on your login node). -```sh -. ~/vf38_scratch/spack/share/spack/setup-env.sh -spack load mpich -``` +## Challenge 3 -A template SLURM job file is given at the root of the distributed challenges directory. Copy this for each challenge into their respective sub-directories as every challenge will require running a SLURM job. If want to do some more experimenting, create multiple job scripts that use different amounts of nodes and test the execution time. +Submit the above script to the queue by writing another SLURM bash script. Check the status of the job using `squeue`. Once the job has finished, check the output using `cat`. You can find the output file in the directory you submitted the job from. -You will also need to generate some input for the sum and mergesort challenges. This can be done by compiling and running the program in `generate.cpp`. Run the following commands to build an generate the inputs for your challenges. +## Challenge 4 -```sh -module load gcc/10.2.0 -g++ -std=c++20 -o bin/generate generate.cpp -bin/generate 1000000000 -``` +Request an interactive node and attach to it. Once you have done this, install python 3.7 using conda. -> Note: -> -> - You do not have to worry about how to read the numbers from the file, this is handled for you already but it is recommended to look at the read function in `read.h` and understand what it is doing. -> - The expected output of the 'sum' challenge is found in the generated `output.txt` file within the challenges directory. -> The expected output of the 'mergesort' challenge is found in the generated `sorted.txt` file within the challenges directory however this will contain a lot of values so a check function is provided that compares a resorted version of your input to your sorted output. -> The sum and mergesort programs you will develop take a number as input. This is the size of the input data that you are performing your programs on. This should be the same number as the one used with the generator program. In the template programs for this challenge they are maked as an pointer to data called `input`. -> Given the above setup and configuration, the input data will contain ~8GB of data or ~8.0e9 bytes so make sure to allocate enough resources both in the programs an in the SLURM job scripts. +## Challenge 5 -## Task 1 - Multinode 'Hello, world!' +Clone and run [this](./dl_on_m3/alexnet_stl10.py) script. You will need to first install the dependencies for it. You don't need to wait for it to finish, just make sure it is working. You will know its working if it starts listing out the loss and accuracy for each epoch. You can stop it by pressing `ctrl + c`. -Your first task is to say 'Hello, world!' from different nodes on M3. This involves printing the nodes name, rank (ID) and the total number of nodes in the MPI environment. +Once you have confirmed that it is working, deactivate and delete the conda environment, and then end the interactive session. -## Task 2 - Ping Pong +> Hint: I have included the dependencies and their versions (make sure you install the right version) in the `requirements.txt` file. You will need python 3.7 to run this script. -For this next task you will play a Ping-Pong game of sorts between two nodes. This will involve passing a count between the two nodes and incrementing the count for each send and receive. This should increment the count to 10 in the end. +## Challenge 6 -## Task 3 - Multinode Sum +Go back to the login node. Now you are going to put it all together. Write a bash script that does the following: -Your next task is to sum the numbers in the generated `input.txt` file together across ten nodes. This will involve summing 1,000,000,000 floats together. The rough expected output is contained in the `output.txt` file. Remember the input array is already given in the template file. +- (1) requests a compute node +- (2) installs python using conda +- (3) clones and runs the above script -## Task 4 - Multinode Mergesort +Let this run fully. Check the output of the script to make sure it ran correctly. Does it match the output of the script you ran in challenge 5? +> Hint: You can check the output of the script at any time by `cat`ing the output file. The script does not need to have finished running for you to do this. -Your final task is to sort the numbers from the input file `unsorted.txt` using a distributed version of mergesort. This will involve ten nodes running their won mergesorts on chunks of the input data individually and then a final mergesort of the intermediate results. Remember the input array is already given in the template file. +## Challenge 7 + +Edit your submission script so that you get a gpu node, and run the script using the gpu. +> Hint: Use the m3h partition + +## Challenge 8 + +Now you want to clean up your working directory. First, push your solutions to your challenges repo. Then, delete the challenges directory, as well as the conda environment you created in challenge 6. diff --git a/src/chapter5/chapter5.md b/src/chapter5/chapter5.md index 4d82439..82d33ef 100644 --- a/src/chapter5/chapter5.md +++ b/src/chapter5/chapter5.md @@ -1,7 +1,7 @@ -# Distributed Computing +# M3 & SLURM -- [Refresher on Parallelism](parallel-refresher.md) -- [What is Distributed Computing](distributed-computing.md) -- [OpenMPI](openmpi.md) -- [Message Passing](message-passing.md) -- [Challenges](challenges.md) +[M3](https://docs.massive.org.au/M3/index.html) is part of [MASSIVE](https://https://www.massive.org.au/), which is a High Performance Computing facility for Australian scientists and researchers. Monash University is a partner of MASSIVE, and provides a majority of the funding for it. M3 is made up of multiple different types of servers, with a total of 5673 cores, 63.2TB of RAM, 5.6PB of storage, and 1.7 million CUDA cores. + +M3 utilises the [Slurm](https://slurm.schedmd.com/) workload manager, which is a job scheduler that allows users to submit jobs to the cluster. We will learn a bit more about this later on. + +This book will introduce the theory behind HPC clusters and how parallel & distributed computing works on these systems. After this, you will learn how to connect to and use M3 along with how SLURM works and how to submit jobs and take advantage of the massive computational capability that M3 provides. diff --git a/src/chapter5/distributed-computing.md b/src/chapter5/distributed-computing.md deleted file mode 100644 index 7aa688e..0000000 --- a/src/chapter5/distributed-computing.md +++ /dev/null @@ -1,44 +0,0 @@ -# What is Distributed Computing - -**Distributed computing is parallel execution on distributed memory architecture.** - -This essentially means it is a form of parallel computing, where the processing power is spread across multiple machines in a network rather than being contained within a single system. In this memory architecture, the problems are broken down into smaller parts, and each machine is assigned to work on a specific part. - -![distributed memory architecture](imgs/distributed_memory_architecture.png) - -## Distributed Memory Architecture - -Lets have a look at the distributed memory architecture in more details. - -- Each processor has its own local memory, with its own address space -- Data is shared via a communications network using a network protocol, e.g Transmission Control Protocol (TCP), Infiniband etc.. - -![Distributed Memory Architecture](imgs/distributed_memory_architecture_2.png) - -## Distributed vs Shared program execution - -The following diagram provides another way of looking at the differences between distributed and shared memory architecture and their program execution. - -![Distributed vs Shared](imgs/distributed_vs_shared.png) - -## Advantages of distributed computing - -There are number of benefits to distributed computing in particular it addresses some shortcomings of shared memory architecture. - -- No contention for shared memory since each machine has its own memory. Compare this to shared memory architecture where all the cpu's are sharing the same memory. -- Highly scalable as we can add more machines and are not limited by RAM. -- Effectively resulting in being able to handle large-scale problems - -The benefits above do not come without some drawbacks including network overhead. - -## Disadvantages of distributed computing - -- Network overload. Network can be overloaded by: - - Multiple small messages - - Very large data throughput - - Multiple all-to-all messages ($N^2$ growth of messages) -- Synchronization failures - - Deadlock (processes waiting for an input from another process that never comes) - - Livelock (even worse as it’s harder to detect. All processes shuffling data around but not progressing in the algorithm ) -- More complex software architecture design. - - Can also be combined with threading-technologies as openMP/pthreads for optimal performance. diff --git a/src/chapter3/imgs/aaf.png b/src/chapter5/imgs/aaf.png similarity index 100% rename from src/chapter3/imgs/aaf.png rename to src/chapter5/imgs/aaf.png diff --git a/src/chapter3/imgs/aaf_strudel.png b/src/chapter5/imgs/aaf_strudel.png similarity index 100% rename from src/chapter3/imgs/aaf_strudel.png rename to src/chapter5/imgs/aaf_strudel.png diff --git a/src/chapter3/imgs/auth_strudel.png b/src/chapter5/imgs/auth_strudel.png similarity index 100% rename from src/chapter3/imgs/auth_strudel.png rename to src/chapter5/imgs/auth_strudel.png diff --git a/src/chapter5/imgs/batch-processing.jpeg b/src/chapter5/imgs/batch-processing.jpeg new file mode 100644 index 0000000..b6eb6c9 Binary files /dev/null and b/src/chapter5/imgs/batch-processing.jpeg differ diff --git a/src/chapter3/imgs/filezilla_connect_m3.png b/src/chapter5/imgs/filezilla_connect_m3.png similarity index 100% rename from src/chapter3/imgs/filezilla_connect_m3.png rename to src/chapter5/imgs/filezilla_connect_m3.png diff --git a/src/chapter3/imgs/filezilla_sitemanager.png b/src/chapter5/imgs/filezilla_sitemanager.png similarity index 100% rename from src/chapter3/imgs/filezilla_sitemanager.png rename to src/chapter5/imgs/filezilla_sitemanager.png diff --git a/src/chapter3/imgs/gurobi.png b/src/chapter5/imgs/gurobi.png similarity index 100% rename from src/chapter3/imgs/gurobi.png rename to src/chapter5/imgs/gurobi.png diff --git a/src/chapter3/imgs/gurobi2.png b/src/chapter5/imgs/gurobi2.png similarity index 100% rename from src/chapter3/imgs/gurobi2.png rename to src/chapter5/imgs/gurobi2.png diff --git a/src/chapter5/imgs/interactive-processing.png b/src/chapter5/imgs/interactive-processing.png new file mode 100644 index 0000000..fdfb2e9 Binary files /dev/null and b/src/chapter5/imgs/interactive-processing.png differ diff --git a/src/chapter5/imgs/login-compute-nodes.jpeg b/src/chapter5/imgs/login-compute-nodes.jpeg new file mode 100644 index 0000000..372e700 Binary files /dev/null and b/src/chapter5/imgs/login-compute-nodes.jpeg differ diff --git a/src/chapter5/imgs/parallel-distributed.png b/src/chapter5/imgs/parallel-distributed.png new file mode 100644 index 0000000..2c7b8c2 Binary files /dev/null and b/src/chapter5/imgs/parallel-distributed.png differ diff --git a/src/chapter3/imgs/putty_key_not_cached.png b/src/chapter5/imgs/putty_key_not_cached.png similarity index 100% rename from src/chapter3/imgs/putty_key_not_cached.png rename to src/chapter5/imgs/putty_key_not_cached.png diff --git a/src/chapter3/imgs/putty_start.png b/src/chapter5/imgs/putty_start.png similarity index 100% rename from src/chapter3/imgs/putty_start.png rename to src/chapter5/imgs/putty_start.png diff --git a/src/chapter5/imgs/scale-out-up.png b/src/chapter5/imgs/scale-out-up.png new file mode 100644 index 0000000..2a9f588 Binary files /dev/null and b/src/chapter5/imgs/scale-out-up.png differ diff --git a/src/chapter5/imgs/slurm-arch.gif b/src/chapter5/imgs/slurm-arch.gif new file mode 100644 index 0000000..7f5d7b7 Binary files /dev/null and b/src/chapter5/imgs/slurm-arch.gif differ diff --git a/src/chapter3/imgs/strudel1.png b/src/chapter5/imgs/strudel1.png similarity index 100% rename from src/chapter3/imgs/strudel1.png rename to src/chapter5/imgs/strudel1.png diff --git a/src/chapter3/imgs/strudel2.png b/src/chapter5/imgs/strudel2.png similarity index 100% rename from src/chapter3/imgs/strudel2.png rename to src/chapter5/imgs/strudel2.png diff --git a/src/chapter3/imgs/strudel_home.png b/src/chapter5/imgs/strudel_home.png similarity index 100% rename from src/chapter3/imgs/strudel_home.png rename to src/chapter5/imgs/strudel_home.png diff --git a/src/chapter5/job-scripting.md b/src/chapter5/job-scripting.md new file mode 100644 index 0000000..1777b9d --- /dev/null +++ b/src/chapter5/job-scripting.md @@ -0,0 +1 @@ +# Job Scripting diff --git a/src/chapter3/login.md b/src/chapter5/login.md similarity index 56% rename from src/chapter3/login.md rename to src/chapter5/login.md index c1b91e5..1e80416 100644 --- a/src/chapter3/login.md +++ b/src/chapter5/login.md @@ -1,8 +1,16 @@ -# Logging In +# Accessing M3 MASSIVE -First you will need to ssh into a login node in the cluster. You do this by doing the following: +The M3 MASSIVE supercomputer has a particular kind of distributed computing architecture called the [master-slave architecture](https://en.wikipedia.org/wiki/Master–slave_(technology)). The master (aka driver) node in SLURM & M3 is referred to as the **login node** while the slave nodes are the **compute nodes**. -## Windows +![login & compute nodes](./imgs/login-compute-nodes.jpeg) + +To access the M3 cluster's login node you have two options, +- Use the SSH protocol in a terminal session. +- Use Strudel (webapp built on top of SSH) to create a desktop session. + +## Login via SSH + +### Windows If you are using windows, the best way to ssh into m3 is by using [puTTY](https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html). @@ -18,7 +26,7 @@ If it the first time accessing M3 from this client then you may see something li Just click Accept, and puTTY will add the cluster's ssh fingerprint to cache. -## Mac / Linux +### Mac / Linux On macOS or linux, ssh is built into the terminal, so just copy the following into your shell, substituting username for your username. @@ -78,3 +86,35 @@ a ticket for your issue. ``` Once you are done and want to logout, just type `exit`. This will close the connection. + +## Login via Strudel + +STRUDEL is a web application used to connect to M3. There are two main benefits to this over regular ssh. Firstly, you are able to access a desktop session, so you can interact easier with M3, look at graphs, etc.. STRUDEL also enables the use of Jupyter notebooks, which are especially useful for data science and machine learning. + +### Accessing STRUDEL + +First, go to the [STRUDEL](https://beta.desktop.cvl.org.au/) website. You should see something like this: + +![strudel select cvl](imgs/strudel1.png) + +Select the CVL option, and you should be taken to another page, where you choose how to log in. + +![strudel login](imgs/strudel2.png) + +Select AAF. On the next page, search for and select Monash University. + +![AAF Login](imgs/aaf_strudel.png) + +You will now be taken to the Monash login page. Once you have logged in, it will show one last page, asking permission to use your details. Click allow, and you will be taken to the STRUDEL home page. + +![strudel home page](imgs/strudel_home.png) + +### Desktop Session + +To start a desktop session using STRUDEL, click on the **Desktop** tab on the side, select your desired options, and click launch. Once the session has started, you will be able to attach to it by clicking on the connect button in the *Pending / Running Desktops* section. + +### Jupyter Notebooks + +Similar to Desktops, if you want a basic Jupyter notebook, click on the **Jupyter Lab** tab, choose how much compute you want, and click launch. + +If you want to have a more customised Jupyter notebook, you can do this by first sshing into M3, and activate conda. Then activate the conda environment `jupyterlab`. Install you desired packages in this environment. Once you have done this, go back to STRUDEL, and launch a **Jupyter Lab - BYO** session. \ No newline at end of file diff --git a/src/chapter3/shared-fs.md b/src/chapter5/m3-interface.md similarity index 80% rename from src/chapter3/shared-fs.md rename to src/chapter5/m3-interface.md index f881310..973be5d 100644 --- a/src/chapter3/shared-fs.md +++ b/src/chapter5/m3-interface.md @@ -1,4 +1,21 @@ -# M3's Shared Filesystem +# M3 Interface & Usage +Along with Slurm, M3 provides an interface with a set of tools, software packages, commands and a way of working with and using it's directory structure. We provide a brief overview of this interface below. To get more info visit their [documentation website](https://docs.massive.org.au). + +## M3 Specific Commands + +| Command | Function | +| --- | --- | +| `module load ` | load a module | +| `module unload ` | unload a module | +| `module avail` | list available modules | +| `module list` | list loaded modules | +| `module spider ` | search for a module | +| `module help ` | get help for a module | +| `module show ` | show details about a module | +| `module purge` | unload all modules | +| `module swap ` | swap two modules | + +## M3's Shared Filesystem When we talk about a shared filesystem, what we mean is that the filesystem that M3 uses allows multiple users or systems to access, manage, and share files and directories over a network, concurrently. It enables users to collaborate on projects, share resources, and maintain a unified file structure across different machines and platforms. In addition to this, it enables the many different compute nodes in M3 to access data from a single source which users also have access to, simplifying the process of running jobs on M3. @@ -6,36 +23,36 @@ Very simply, the way it works is that the home, project and scratch directories M3 has a unique filesystem consisting of three main important parts (for you). -## Home Directory +### Home Directory There is each user's personal directory, which only they have access to. This has a ~10GB allocation, and should store any hidden files, configuration files, or other files that you don't want to share with others. This is backed up nightly. -## Project Directory +### Project Directory This is the shared project directory, for all users in MDN to use. This has a ~1TB allocation, and should be used only for project specific files, scripts, and data. This is also backed up nightly, so in the case that you accidentally delete something important, it can be recovered. -## Scratch Directory +### Scratch Directory This is also shared with all users in MDN, and has more allocation (~3TB). You may use this for personal projects, but keep your usage low. In general it is used for temporary files, larger datasets, and should be used for any files that you don't need to keep for a long time. This is not backed up, so if you delete something, it's gone forever. -## General Rules +### General Rules - Keep data usage to a minimum. If you have a large amount of data, consider moving it to the scratch directory. If it is not necessary to keep it, consider deleting it. - Keep your home directory clean. - In general, it is good practice to make a directory in the shared directory for yourself. Name this your username or name, to make it easily identifiable. This is where you should store your files for small projects or personal use. - The project directory is not for personal use. Do not store files in the project directory that are not related to MDN. Use the scratch directory instead. -## Copying files to and from M3 +### Copying files to and from M3 Copying files to and from M3 can be done in a few different ways. We will go over the basics of scp, as well as setting up FileZilla. A key thing to remember when copying files to and from M3 is that you shouldn't be using the regular ssh url. Instead, they have a dedicated SFTP url to use for file transfers. This is `m3-dtn.massive.org.au`. This is the url you will use when setting up FileZilla, and when using scp. -### Using scp +#### Using scp You can copy files to M3 using the `scp` command. This is a command line tool that is built into most linux distributions. If you are using Windows, you will need to install a tool like [Git Bash](https://gitforwindows.org/) to use this command. -#### Linux / Mac +##### Linux / Mac To copy a file to M3, use the following command: @@ -77,3 +94,4 @@ Don't change anything else. Leave password blank for now. It should look something like this: ![Add M3 as a site](./imgs/filezilla_connect_m3.png) Click on Connect, and enter your password when prompted. You should now be connected to M3. You can now drag and drop files to and from M3. + diff --git a/src/chapter5/message-passing.md b/src/chapter5/message-passing.md deleted file mode 100644 index f6d8742..0000000 --- a/src/chapter5/message-passing.md +++ /dev/null @@ -1,11 +0,0 @@ -# Message Passing - -As each processor has its own local memory with its own address space in distributed computing, we need a way to communicate between the processes and share data. Message passing is the mechanism of exchanging data across processes. Each process can communicate with one or more other processes by sending messages over a network. - -The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments and are implemented by different groups with the main goals being high performance, scalability, and portability. - -OpenMPI is one implementation of the MPI standard. It consists of a set of headers library functions that you call from your program. i.e. C, C++, Fortran etc. - -For C, you will need a header file for all the functions (mpi.h) and link in the relevant library functions. This is all handled by the mpicc program (or your compiler if you wanted to specify all the paths). - -In the next chapter we will look at how to implement message passing using OpenMPI. diff --git a/src/chapter5/parallel-distributed.md b/src/chapter5/parallel-distributed.md new file mode 100644 index 0000000..ecca38b --- /dev/null +++ b/src/chapter5/parallel-distributed.md @@ -0,0 +1,55 @@ +# Parallel & Distributed Computing + +Nearly all modern computer systems utilise parallel computing to speed up the execution of algorithms. To see how this works in practice look at the diagram below. + +![parallel vs. distributed](imgs/parallel-distributed.png) + +As you can see, in a scenario where a program (job) takes 3 seconds and 3 independent jobs have to be executed by a system, doing it serially in a single computer takes a total of 9 seconds. But doing it simultaneously across 3 computers will only take 3 seconds thus achieving a 3x speedup through parallel computing. + +This is the fundamental principle that High Performance Computing is based on. + +## What is Distributed Computing? + +**Distributed computing is parallel execution on a distributed memory architecture.** + +This essentially means it is a form of parallel computing, where the processing power is spread across multiple machines in a network rather than being contained within a single system. In this memory architecture, the problems are broken down into smaller parts, and each machine is assigned to work on a specific part. + +![distributed memory architecture](imgs/distributed_memory_architecture.png) + +### Distributed Memory Architecture + +Lets have a look at the distributed memory architecture in more details. + +- Each processor has its own local memory, with its own address space +- Data is shared via a communications network using a network protocol, e.g Transmission Control Protocol (TCP), Infiniband etc.. + +![Distributed Memory Architecture](imgs/distributed_memory_architecture_2.png) + +Each machine or **node** is connected to the HPC cluster via a network, typically one with high bandwidth and low latency. The fact that these are largely independent computers connected over a network rather than a set of CPU/GPU cores in the same computer (in parallel computing), presents a set of disadvantages. + +### Advantages of parallel & local computing: +- Less **data transfer latency** than a distributed system. The system bus inside a machine has a much lower latency compared to even the fastest computer networks. +- No need to deal with **network congestion** related issues and complexities involved in setting up, connecting and managing networks. +- Inter-process communication is **a lot simpler** on a local computer compared to a distributed system (can just use OpenMP instead of MPI). + +On the other hand, there is a reason why most modern computer systems - both cloud & HPC, use distributed computing. + +## Scaling Up vs Scaling Out + +Any time you want to make a computer system more powerful you have two choices - you either scale "up" or you scale "out". +- __Scale up:__ This involves the most intuitive way to increase computational capability. Simply upgrade your system to have more powerful hardware i.e. Increased RAM memory, CPU frequency, no. of cores, etc... +- __Scale out:__ This is where distributed and parallel computing shines. Instead of upgrading the hardware of existing systems, you just add more computers (node) to the system and use distributed computing software to take advantage of the expanded network. + +![Scale up vs out](./imgs/scale-out-up.png) + +### Advantages of distributed computing (scaling out): +- It's usually cheaper and **better value for money** to add new, cheap computers (nodes) to a HPC cluster than rotate existing computers with upgraded hardware. +- Better redundancy and **disaster recovery** when you're relying on more than one computer. Adding more nodes to your system allows you to do things like duplicate data and prepare for system meltdowns. +- Fewer **I/O bottleneck** issues. In a single computer there's only so many additional CPU cores, RAM and secondary storage you can add before the memory addressing system faces bottlenecks. A distributed computing system has nodes with their own individual processors and memory. + +And finally, not everything needs to be done on a parallel or distributed system. There are certain issues with this paradigm (explored in more detail in later chapters) that result in certain advantages for serial computing. + +### Advantages of serial computing: +- **More simple** to design & implement algorithms. Parallel algorithms can get quite complex, especially when dealing with more complicated instructions with dependencies. +- **Less overhead** involved in managing a parallel & distributed job. No need to manage data sharing between threads, processes, nodes, etc... +- **No synchronisation issues** & headaches involved in concurrent computing. Don't have to deal with race conditions, deadlocks, livelocks, etc... \ No newline at end of file diff --git a/src/chapter5/parallel-refresher.md b/src/chapter5/parallel-refresher.md deleted file mode 100644 index c4dbc3c..0000000 --- a/src/chapter5/parallel-refresher.md +++ /dev/null @@ -1,31 +0,0 @@ -# Refresher on Parallelism - -## Task Parallelism - -We saw in the last chapter parallel computing can be used to solve problems by executing code in parallel as opposed to in series. - -![Task parallelism](imgs/task_parallelism.jpg) - -## Data Parallelism - -Note that not all programs can be broken down into independent tasks and we might instead data parallelism like the following. - -![Data parallelism](imgs/data_parallelism.jpg) - -## Parallel computing example - -Think back to the example below which was provided in the last chapter. We will look at the cost of memory transactions soon. - -![Parallel computing example](imgs/parallel_computing_arrays_eg.png) - -## Parallel Scalability - -The speed up achieved from parallelism is dictated by your algorithm. Notably the serial bits of your algorithm can not be sped up by increasing the number of processors. The diagram below looks at the benefits we can achieve from writing parallel code as the number of processes increases. - -![Parallel scalability](imgs/parallel_scalability.jpg) - -## Memory Architectures - -Lastly, the different memory architectures we looked at in the last section included shared memory, distributed memory and hybrid architectures. We have looked at shared memory in detail and now we will dive into distributed memory architecture. - -![Memory architectures](imgs/memory_architectures.jpg) diff --git a/src/chapter5/slurm_intro.md b/src/chapter5/slurm_intro.md new file mode 100644 index 0000000..3dd7e16 --- /dev/null +++ b/src/chapter5/slurm_intro.md @@ -0,0 +1,153 @@ +# SLURM + +Slurm (Simple Linux Utility for Resource Management) is an open-source workload manager/scheduler for the M3 MASSIVE cluster. It was created back in the mid-2000s by SchedMd and now it's used by approximately 65% of the world's supercomputers. Slurm is basically the intermediary between the Login node and compute nodes. Hence, the Slurm scheduler is the gateway for the users on the login node to submit work/jobs to the compute nodes for processing. + +Slurm has three key functions. +1. It provides exclusive and/or non-exclusive access to the resources on the compute nodes to the users for a certain amount of time. Hence, the users can perform any computation with the resources. +2. It provides a framework to start, execute, and check the work on the set of allocated compute nodes. +3. It manages the queue of pending jobs based on the availability of resources. + +## Basic Linux Commands + +Since Slurm and M3 nodes are implemented on Linux, it's necessary to know some basic Linux commands. You will have to be comfortable using these commands, writing Bash scripts and navigating the Linux environment in order to be successful in a lot of our HPC projects. + +| Command | Function | +| --- | --- | +| `pwd` | prints current directory | +| `ls` | prints list of files / directories in current directory (add a `-a` to list everything, including hidden files/directories | +| `mkdir` | makes a directory | +| `rm ` | deletes *filename*. add `-r` to delete directory. add `-f` to force deletion (be really careful with that one) | +| `cd ` | move directory. | +| `vim` or `nano` | bring up a text editor | +| `cat ` | prints contents of file to terminal | +| `echo` | prints whatever you put after it | +| `chmod ` | changes permissions of file | +| `cp` | copy a file or directory| +| `mv ` | move or rename file or directory | + +> Note: `.` and `..` are special directories. `.` is the current directory, and `..` is the parent directory. These can be used when using any command that takes a directory as an argument. Similar to these, `~` is the home directory, and `/` is the root directory. For example, if you wanted to copy something from the parent directory to the home directory, you could do `cp ../ ~/`, without having to navigate anywhere. + +## Bash Scripts + +Bash is both a command line interface and a scripting language. Linux commands are generally using Bash. Bash scripts are a series of commands that are executed in order. Bash scripts are useful for automating tasks that you do often, or for running a series of commands that you don't want to type out every time. In our case, Bash scripts are used for running jobs on M3. + +In terms of use, Bash can encapsulate any command you would normally run in the terminal into a script that can be easily reused. For example you could have a script that automatically navigates to a directory and activates a virtual environment, or a script that compiles and runs a C program. + +The basic syntax of a bash file is as follows: + +```bash +#!/bin/bash + +# This is a comment + +echo "Hello World" +``` + +We can save this file as `hello.sh` and run it using the following command: `source hello.sh`. This will print `Hello World` to the terminal. + +Let's walk through the file. The first line is `#!/bin/bash`. This is called a shebang, and it tells the system that this file is a bash script. The second line is a comment, and is ignored by the system. The third line is the actual command that we want to run. In this case, we are using the `echo` command to print `Hello World` to the terminal. + +Bash can do a lot more, including basic arithmetic, if statements, loops, and functions, however these are not really necesary for what we are doing. If you want to learn more about bash, you can find a good tutorial [here](https://linuxconfig.org/bash-scripting-tutorial). + +For our use, the main things we need to be able to do are to run executables and files. We do this the exact same way as if manually running them in the terminal. For example, if we want to run a python script, we can do the following: + +```bash +#!/bin/bash + +# This will run hello.py using the python3 executable +python3 hello.py +``` + +If we want to compile and then run a C program, we can do the following: + +```bash +#!/bin/bash + +# This will compile hello.c and then run it +gcc hello.c -o hello +./hello +``` + +Using bash scripts not only saves a lot of time and effort, but it also makes it easier to run jobs on M3 using SLURM. We will go over how to do this in the next section. + +## Slurm Architecture +Slurm has a centralized manager, slurmctld, to monitor resources and work. Each compute server (node) has a slurmd daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work. There is an optional slurmdbd (Slurm DataBase Daemon) which can be used to record job accounting information in a database. + +![slurm-arch](./imgs/slurm-arch.gif) + +## Basic Slurm Commands +Slurm provides a variety of tools for users to submit and manage jobs along with viewing info about them. These commands can be used interactively (on a terminal) in the login node. + +| Commands | Syntax | Description | +| --- | --- | --- | +| `sinfo` | `sinfo` | Get information about the resources on available nodes that make up the HPC cluster. | +| `sbatch` | `sbatch ` | Submit a batch script to Slurm for processing. | +| `srun` | `srun ` | Run jobs interactively on the cluster. | +| `skill/scancel` | `scancel ` | End or cancel a queued job. | +| `squeue` | `squeue -u` | Show information about your job(s) in the queue. The command when run without the -u flag, shows a list of your job(s) and all other jobs in the queue. | +| `sacct` | `sacct` | Show information about current and previous jobs. | + +## Slurm Job Scripting +Slurm job scripts are very similar to bash scripts but you will have to use a set of Slurm-specific directives (flags beginnig #) and Slurm-specific & Bash commands. + +In creating a Slurm script, there are **4 main parts** that are mandatory in order for your job to be successfully processed. + +1. **Shebang:** The Shebang command tells the shell (which interprets the UNIX commands) to interpret and run the Slurm script using the bash (Bourne-again shell) shell. + +> This line should always be added at the very top of your SBATCH/Slurm script. (Same as all Bash scripts) + +2. **Resource Request:** In this section, the amount of resources required for the job to run on the compute nodes are specified. This informs Slurm about the name of the job, output filename, amount of RAM, Nos. of CPUs, nodes, tasks, time, and other parameters to be used for processing the job. + +> These SBATCH commands are also know as SBATCH directives and must be preceded with a pound sign and should be in an uppercase format as shown below. + +``` +#SBATCH --job-name=TestJob +#SBATCH --output=TestJob.out +#SBATCH --time=1-00:10:00 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=1 +#SBATCH --mem-per-cpu=500M +``` + +3. **Dependencies:** Load all the software that the project depends on to execute. For example, if you are working on a python project, you’d definitely require the python software or module to interpret and run your code. Go to Chapter 5.6 for more info on this. + +``` +module load python +``` + +4. **Job Steps** Specify the list of tasks to be carried out. + +``` +srun echo "Start process" +srun hostname +srun sleep 30 +srun echo "End process" +``` + +### Putting it all together +Please note that the lines with the double pound signs (##) are comments when used in batch scripts. + +``` +## Shebang +#!/bin/bash + +## Resource Request +#SBATCH --job-name=TestJob +#SBATCH --output=TestJob.out +#SBATCH --time=1-00:10:00 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=1 +#SBATCH --mem-per-cpu=500M + +## Job Steps +srun echo "`Start process`" +srun hostname +srun sleep 30 +srun echo "`End process`" +``` + +In the script above, 1 Node with 1 CPU, 500MB of memory per CPU, 10 minutes of Walltime was requested for the tasks (Job steps). Note that all the job steps that begin with the srun command will execute sequentially as one task by one CPU only. + +The first job step will run the Linux echo command and output Start process. The next job step(2) will echo the Hostname of the compute node that executed the job. Then, the next job step will execute the Linux sleep command for 30 seconds. The final job step will just echo out End process. Note that these job steps executed sequentially and not in parallel. + +It’s important to set a limit on the total run time of the job allocation, this helps the Slurm manager to handle prioritization and queuing efficiently. The above example is a very simple script which takes less than a second. Hence, it’s important to specify the run time limit so that Slurm doesn’t see the job as one that requires a lot of time to execute. \ No newline at end of file diff --git a/src/chapter3/software-tooling.md b/src/chapter5/software-tooling.md similarity index 97% rename from src/chapter3/software-tooling.md rename to src/chapter5/software-tooling.md index a3ec5eb..66550f8 100644 --- a/src/chapter3/software-tooling.md +++ b/src/chapter5/software-tooling.md @@ -32,9 +32,9 @@ Python is a bit of a special case on M3. This is because of how many different v These instructions are based off the M3 docs, which can be found [here](https://docs.massive.org.au/M3/software/pythonandconda/pythonandconda.html#pythonandconda). -### Miniconda +## Miniconda -#### Installing Miniconda +### Installing Miniconda To install Miniconda on M3, there is a dedicated install script that you can use. This will install miniconda into your default scratch space, i.e. `/vf38_scratch//miniconda3`. To install miniconda, run the following command: @@ -48,7 +48,7 @@ conda-install conda-install your/install/location ``` -#### Activating Miniconda +### Activating Miniconda To activate the base conda environment, run the following command: @@ -69,7 +69,7 @@ conda create --name env-name python= conda activate env-name ``` -#### Managing Python packages +### Managing Python packages Use the following commands to install and manage Python packages: @@ -86,11 +86,11 @@ conda update package-name conda remove package-name ``` -#### Deactivating Miniconda +### Deactivating Miniconda To deactivate the conda environment you are in, run `conda deactivate`. To exit conda entirely run `conda deactivate` again. You will know you have fully exited conda when `(base)` is no longer in the prompt. -### VIM +## VIM VIM is a terminal based text editor. You may have heard about it, or even tried using it before. If so, you might recognise the common meme of "how do I exit VIM???". This is because VIM uses a very different key binding system to other text editors, and it can be a little confusing to get used to. However, once you get used to it, it is actually a very powerful and efficient text editor. diff --git a/src/chapter6/challenges.md b/src/chapter6/challenges.md new file mode 100644 index 0000000..0d9c4ce --- /dev/null +++ b/src/chapter6/challenges.md @@ -0,0 +1,40 @@ +# Parallel Computing Challenges + +## Pre-Tasks + +Make sure to clone a copy of **your** challenges repo onto M3, ideally in a personal folder on vf38_scratch. + +> Note: For every challenge you will be running the programs as SLURM jobs. This is so we don't overload the login nodes. A template [SLURM job script](./job.slurm) is provided at the root of this directory which you can use to submit your own jobs to SLURM by copying it to each challenges sub-directory and filling in the missing details. You may need more than one for some challenges. This template will put the would-be-printed output in a file named `slurm-.out`. + +## Task 1 - Single Cluster Job using OpenMP + +Create a program in `hello.c` that prints 'Hello, world from thread: ' to the output. Launch the job to a node SLURM. Next, extend the program to run on multi-nodes using OpenMPI. + +> Note: +> +> - The output of a job is put in a slurm-.out file by default. +> - The template slurm job scripts will output the results to a `slurm-.out` file. + +## Task 2 - Parallel `for` Loop + +In `array-gen.c` implement a program that generates an array containing the numbers 0..10'000 elements (inclusive) using a `for` loop. Measure the execution time using the `time` Linux command. Now reimplement the program to utilise OpenMP's parallel `for` loop macros, measuring the execution time again. Is there any performance improvement? Are the elements still in the correct order and if not how can you fix this. Try experimenting with different sized arrays and element types. Again, extend the program to use multi-nodes. + +> Hint: You will likely need to allocate memory from the heap. + +## Task 3 - Parallel Reductions + +In the C chapter we created a sum program that summed the elements of an array together. Using this as a base, create a new program that again computes the sum of the elements of an array but using OpenMP, comparing the execution time between the sequential and parallel versions. Is there any performance improvement? How would using a different binary operator change our ability to parallelize the the reduction? + +If you have time, implement the sum but at each iteration, raise the current value to the power of the current accumulation divide by 100, adding this to the accumulation. Test a serial and parallel version. Is the parallel any faster? + +> Note: `module load gcc` to use newer version of gcc if you have error with something like `-std=c99`. + +## Task 4 - Laplace Equation for Calculating the Temperature of a Square Plane + +For this challenge you will attempt to parallelize an existing implementation of the Laplace Equation. Throughout the source files of this project there are various loops you can try and make faster by utilizing OpenMP macros. See if you can make a faster version in the `laplace2d-parallel.c`. To build these files make sure you're in that directory and use the command `make`. The executables will be in the same directory. + +## Task 5 - Calculate Pi using "Monte Carlo Algorithm" + +For this challenge you will have to try and implement the Monte Carlo algorithm with no framework or template and using everything you've learnt so far. Good luck. + +[Short explanation of Monte Carlo algorithm](https://www.youtube.com/watch?v=7ESK5SaP-bc&ab_channel=MarbleScience) diff --git a/src/chapter6/chapter6.md b/src/chapter6/chapter6.md new file mode 100644 index 0000000..95c1d02 --- /dev/null +++ b/src/chapter6/chapter6.md @@ -0,0 +1,7 @@ +# Parallel Computing + +In this chapter, we will discuss the abstraction of parallel computing. To facilitate our exploration, we will employ a API within the C Programming Language: OpenMP. This tool will serve as a means to concretely illustrate the underlying language-independent theory. + +**Parallel computing is about executing the instructions of the program simultaneously.** + +One of the core values of computing is the breaking down of a big problem into smaller easier to solve problems, or at least smaller problems. In some cases, the steps required to solve the problem can be executed simultaneously (in parallel) rather than sequentially (in order). diff --git a/src/chapter6/imgs/barrier-end.png b/src/chapter6/imgs/barrier-end.png new file mode 100755 index 0000000..e4b54a7 Binary files /dev/null and b/src/chapter6/imgs/barrier-end.png differ diff --git a/src/chapter6/imgs/barrier-wait.png b/src/chapter6/imgs/barrier-wait.png new file mode 100755 index 0000000..fd3f02c Binary files /dev/null and b/src/chapter6/imgs/barrier-wait.png differ diff --git a/src/chapter6/imgs/barrier.png b/src/chapter6/imgs/barrier.png new file mode 100755 index 0000000..10b3c87 Binary files /dev/null and b/src/chapter6/imgs/barrier.png differ diff --git a/src/chapter6/imgs/deadlock.png b/src/chapter6/imgs/deadlock.png new file mode 100644 index 0000000..6a8bd30 Binary files /dev/null and b/src/chapter6/imgs/deadlock.png differ diff --git a/src/chapter6/imgs/explicit-barrier.png b/src/chapter6/imgs/explicit-barrier.png new file mode 100755 index 0000000..0fb836a Binary files /dev/null and b/src/chapter6/imgs/explicit-barrier.png differ diff --git a/src/chapter4/imgs/Threads Visualisation.png b/src/chapter6/imgs/fork-join.png similarity index 100% rename from src/chapter4/imgs/Threads Visualisation.png rename to src/chapter6/imgs/fork-join.png diff --git a/src/chapter6/imgs/htop.png b/src/chapter6/imgs/htop.png new file mode 100644 index 0000000..cbc1fd3 Binary files /dev/null and b/src/chapter6/imgs/htop.png differ diff --git a/src/chapter6/imgs/mpi-routines.png b/src/chapter6/imgs/mpi-routines.png new file mode 100644 index 0000000..20768d6 Binary files /dev/null and b/src/chapter6/imgs/mpi-routines.png differ diff --git a/src/chapter6/imgs/one-thread-counter.png b/src/chapter6/imgs/one-thread-counter.png new file mode 100644 index 0000000..0386ee1 Binary files /dev/null and b/src/chapter6/imgs/one-thread-counter.png differ diff --git a/src/chapter4/imgs/OpenMP and Directive.png b/src/chapter6/imgs/program-structure.png similarity index 100% rename from src/chapter4/imgs/OpenMP and Directive.png rename to src/chapter6/imgs/program-structure.png diff --git a/src/chapter6/imgs/time.png b/src/chapter6/imgs/time.png new file mode 100644 index 0000000..b9f5185 Binary files /dev/null and b/src/chapter6/imgs/time.png differ diff --git a/src/chapter6/imgs/two-threads-counter.png b/src/chapter6/imgs/two-threads-counter.png new file mode 100644 index 0000000..e83f3fd Binary files /dev/null and b/src/chapter6/imgs/two-threads-counter.png differ diff --git a/src/chapter6/locks.md b/src/chapter6/locks.md new file mode 100644 index 0000000..78052d6 --- /dev/null +++ b/src/chapter6/locks.md @@ -0,0 +1,199 @@ +# Locks + +Ealier, we have learnt about how to write concurrent programs, as well as a few constructs to achieve **synchronisation** in OpenMP. We know that: +- `reduction construct` partitions shared data and uses barrier to achieve synchronisation +- `atomic construct` utilises hardware ability to achieve thread-safe small memory read/write operations. + +What about `critical construct`? We said that it uses locks, but what are locks? + +> Notes that the direct use of locks is **not recommended** (at least in OpenMP): +> - It is very easy to cause deadlock or hard-to-debug livelock (more on these at the end of this sub-chapter). +> - It can often cause very poor performance or worse. +> - It generally indicates that the program design is wrong. +> +> We will explore them because it is important to know about what is happening under the hood of the high-level APIs. + +## Overall Idea + +Lock is a synchronization technique. A lock is an abstraction that allows at most one thread to own it at a time. To be more concrete, let's say we have a segment of code, guarded by a **lock**. Then, exactly 1 thread can execute those lines of code at a time (Sounds familiar?). Any other threads (without the lock) trying to access the code segment will have to wait until the lock is released. + +## OpenMP Locks + +Let's start with an example: + +```c +#include +#include + +int total = 0; +int n = 100; +int nums[100]; +omp_lock_t lock; // uninitialized + +int main() { + + omp_init_lock(&lock); // the lock is initialized but unlocked + + // Populate nums + for (int i = 0; i < n; i++) { + nums[i] = i; + } + +#pragma omp parallel for + for (int i = 0; i < n; i++) { + int temp = nums[i]; + + omp_set_lock(&lock); // a thread changes the lock's state to locked + + total += temp; // something that we want only 1 thread execute at a time + + omp_unset_lock(&lock); // the thread owning the lock changes the lock's state to unlocked + } + omp_destroy_lock(&lock); + printf("%d\n", total); +} +``` + +An OpenMP lock can exist in three states: **uninitialized**, **unlocked**, or **locked**. When in the unlocked state, a task can acquire the lock, transitioning it to the locked state. The task acquiring the lock is considered its owner. An owning task can release the lock, reverting it to the unlocked state. Any attempt by a task to release a lock it does not own renders the program non-conforming. + +There are two types of locks supported: simple locks and nested locks: +- Nested locks allow for multiple acquisitions before unlocking. They remain locked until unset as many times as `omp_set_nest_lock` has been called. Nested locks facilitate scenarios where functions call other functions utilizing the same lock. +- Simple locks should be acquired only once using `omp_set_lock` and released with a single call to `omp_unset_lock`. + +## Deadlocks + +When used correctly and cautiously, locks can effectively prevent race conditions. However, there's another issue to be aware of. Because using locks means that threads have to wait (blocking when another thread holds the lock), there's a risk of a situation where two threads end up waiting for each other, leading to a stalemate where neither can progress. + +Let's look at this code: + +```c +#include +#include + +omp_lock_t lock1, lock2; + +int main() { + omp_init_lock(&lock1); + omp_init_lock(&lock2); + +#pragma omp parallel num_threads(2) + { + int thread_id = omp_get_thread_num(); + + if (thread_id == 0) { + omp_set_lock(&lock1); // Thread 0 takes lock 1 + printf("Thread %d acquired lock1\n", thread_id); + omp_set_lock(&lock2); // Attemp to take lock 2 (but already belongs to thread 1 => wait) + printf("Thread %d acquired lock2\n", thread_id); + omp_unset_lock(&lock2); + omp_unset_lock(&lock1); + } + else { + omp_set_lock(&lock2); // Thread 1 takes lock 2 + printf("Thread %d acquired lock2\n", thread_id); + omp_set_lock(&lock1); // Attemp to take lock 1 (but already belings to thread 0 => wait) + printf("Thread %d acquired lock1\n", thread_id); + omp_unset_lock(&lock1); + omp_unset_lock(&lock2); + } + } + + omp_destroy_lock(&lock1); + omp_destroy_lock(&lock2); + + return 0; +} +``` + +The output should be something like this: + +![Deadlock](./imgs/deadlock.png) + +The program is not terminated. However, no thread is making any progress as they are being blocked by each other at the same time! + +Deadlock is not just limited to two modules; the key characteristic of deadlock is a cycle of dependencies: +- A is waiting for B +- B is waiting for C +- C is waiting for A + +In such a loop, none of the threads can move forward. + +## Livelocks + +A more challenging issue that may arise is livelock. Similar to deadlock, livelocked threads are unable to make progress. However, unlike deadlock, where threads are blocked, livelocked threads remain active. They're caught in a continuous and infinite sequence of responding to each other, preventing them from making any meaningful progress in their work. + +```c +#include +#include +#include + +omp_lock_t lock1, lock2; + +void execute_task(int task_number) { + omp_lock_t* first_lock; + omp_lock_t* second_lock; + const char* lock1_message; + const char* lock2_message; + + if (task_number == 1) { + first_lock = &lock1; + second_lock = &lock2; + lock1_message = "lock1"; + lock2_message = "lock2"; + } + else { + first_lock = &lock2; + second_lock = &lock1; + lock1_message = "lock2"; + lock2_message = "lock1"; + } + + while (1) { + omp_set_lock(first_lock); + printf("%s acquired, trying to acquire %s.\n", lock1_message, lock2_message); + + // sleep for 50 milliseconds to illustrate some meaningful tasks, + // and to ensures that the order of lock and unlock can not correct itself by chance + usleep(50000); + + if (omp_test_lock(second_lock)) { + printf("%s acquired.\n", lock2_message); + } + else { + printf("cannot acquire %s, releasing %s.\n", lock2_message, lock1_message); + omp_unset_lock(first_lock); + continue; + } + + printf("executing task %d.\n", task_number); + break; + } + omp_unset_lock(second_lock); + omp_unset_lock(first_lock); +} + +int main() { + omp_init_lock(&lock1); + omp_init_lock(&lock2); + +// each section will be executed in parallel +#pragma omp parallel sections + { +#pragma omp section + { + execute_task(1); + } + +#pragma omp section + { + execute_task(2); + } + } + + omp_destroy_lock(&lock1); + omp_destroy_lock(&lock2); + + return 0; +} + +``` diff --git a/src/chapter5/openmpi.md b/src/chapter6/message-passing.md similarity index 76% rename from src/chapter5/openmpi.md rename to src/chapter6/message-passing.md index 6e89875..d893137 100644 --- a/src/chapter5/openmpi.md +++ b/src/chapter6/message-passing.md @@ -1,6 +1,16 @@ -# OpenMPI +# Message Passing -## Primary MPI Routines +As each processor has its own local memory with its own address space in distributed computing, we need a way to communicate between the processes and share data. Message passing is the mechanism of exchanging data across processes. Each process can communicate with one or more other processes by sending messages over a network. + +The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments and are implemented by different groups with the main goals being high performance, scalability, and portability. + +OpenMPI is one implementation of the MPI standard. It consists of a set of headers library functions that you call from your program. i.e. C, C++, Fortran etc. + +For C, you will need a header file for all the functions `mpi.h` and link in the relevant library functions. This is all handled by the mpicc program (or your compiler if you wanted to specify all the paths). + +## OpenMPI + +### Primary MPI Routines ``` C int MPI_Init(int * argc, char ** argv); @@ -22,9 +32,9 @@ int MPI_Comm_rank(MPI_Comm comm, int \* rank); // rank contains the value for that process- the function return value is an error code ``` -![MPI routines](imgs/mpi_routines.png) +![MPI routines](imgs/mpi-routines.png) -### Point-to-Point communication +#### Point-to-Point communication These are blocking functions - they wait until the message is sent or received. Note that the CPU is actively polling the network interface when waiting for a message. This is opposite in behaviour to other C functions, i.e. c= getChar() (which causes a context switch and then a sleep in the OS). This is done for speed reasons. @@ -82,7 +92,7 @@ OUTPUT PARAMETERS - ```IERROR``` - Fortran only: Error status (integer). -### Primary MPI Routines closing +#### Primary MPI Routines closing In a header file you will find @@ -97,7 +107,7 @@ To call in your C or C++ program MPI_Finalize(); ``` -## General overview MPI program +### General overview MPI program ``` C ... @@ -119,27 +129,27 @@ Use man pages to find out more about each routine When sending a Process it packs up all of its necessary data into a buffer for the receiving process. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office) -## Elementary MPI Data types +### Elementary MPI Data types MPI_Send and MPI_Recv utilize MPI Datatypes as a means to specify the structure of a message at a higher level. The data types defined in the table below are simple in nature and for custom data structures you will have to define the structure. -| MPI datatype | C equivalent | -|-------------------------|------------------------| -| MPI_SHORT | short int | -| MPI_INT | int | -| MPI_LONG | long int | -| MPI_LONG_LONG | long long int | -| MPI_UNSIGNED_CHAR | unsigned char | -| MPI_UNSIGNED_SHORT | unsigned short int | -| MPI_UNSIGNED | unsigned int | -| MPI_UNSIGNED_LONG | unsigned long int | -| MPI_UNSIGNED_LONG_LONG | unsigned long long int | -| MPI_FLOAT | float | -| MPI_DOUBLE | double | -| MPI_LONG_DOUBLE | long double | -| MPI_BYTE | char | - -## Example of a simple program +| MPI datatype | C equivalent | +| ---------------------- | ---------------------- | +| MPI_SHORT | short int | +| MPI_INT | int | +| MPI_LONG | long int | +| MPI_LONG_LONG | long long int | +| MPI_UNSIGNED_CHAR | unsigned char | +| MPI_UNSIGNED_SHORT | unsigned short int | +| MPI_UNSIGNED | unsigned int | +| MPI_UNSIGNED_LONG | unsigned long int | +| MPI_UNSIGNED_LONG_LONG | unsigned long long int | +| MPI_FLOAT | float | +| MPI_DOUBLE | double | +| MPI_LONG_DOUBLE | long double | +| MPI_BYTE | char | + +### Example of a simple program ``` C @@ -195,7 +205,7 @@ int main(int argc, char *argv[]) } ``` -## Compilation and Linking +### Compilation and Linking - Make sure you have the following packages installed and that they are in your $PATH: - gcc @@ -210,9 +220,9 @@ int main(int argc, char *argv[]) - mpicc is just a wrapper around a C compiler. To see what it does type: - ```mpicc –showme``` -### sbatch to send job to compute nodes using SLURM +#### sbatch to send job to compute nodes using SLURM -``` bash +```bash #!/bin/bash #SBATCH --job-name=Vaccinator #SBATCH --ntasks=4 @@ -231,7 +241,7 @@ mpirun -np 4 ./my-awesome-program - ntasks-per-node Controls the maximum number of tasks per allocated node - cpus-per-task Controls the number of CPUs allocated per task -## Measuring performance +### Measuring performance - ```htop``` to check the CPU usage. You need to run this command while the process is running - If you are using SLURM, you will need to use ```squeue``` or ```scontrol``` to find the compute node it is running on and then ssh into it. diff --git a/src/chapter6/multithreading.md b/src/chapter6/multithreading.md new file mode 100644 index 0000000..34a8dd9 --- /dev/null +++ b/src/chapter6/multithreading.md @@ -0,0 +1,106 @@ +# Multithreading + +We have all looked at the theory of threads and concurrent programming in the Operating System chapter. Now, we will shift our focus to OpenMP and its application for executing multithreaded operations in a declarative programming style. + +## OpenMP + +OpenMP is an Application Program Interface (API) that is used to explicitly direct multi-threaded, shared memory parallelism in C/C++ programs. It is not intrusive on the original serial code in that the OpenMP instructions are made in pragmas interpreted by the compiler. + +> Further features of OpenMP will be introduced in conjunction with the concepts discussed in later sub-chapters. + +### Fork-Join Parallel Execution Model + +OpenMP uses the `fork-join model` of parallel execution. + +* **FORK**: All OpenMP programs begin with a `single master thread` which executes sequentially until a `parallel region` is encountered, when it creates a team of parallel threads. + +The OpenMP runtime library maintains a pool of threads that can be added to the threads team in parallel regions. When a thread encounters a parallel construct and needs to create a team of more than one thread, the thread will check the pool and grab idle threads from the pool, making them part of the team. + +* **JOIN**: Once the team threads complete the parallel region, they `synchronise` and return to the pool, leaving only the master thread that executes sequentially. + +![Fork - Join Model](./imgs/fork-join.png) + +> We will look a bit more into what is synchronisation as well as synchronisation techniques in the next sub-chapter. + +### Imperative vs Declarative + +Imperative programming specifies and directs the control flow of the program. On the other hand, declarative programming specifies the expected result and core logic without directing the program's control flow. + +OpenMP follows a declarative programming style. Instead of manually creating, managing, synchronizing, and terminating threads, we can achieve the desired outcome by simply declaring it using pragma. + +![Structure Overview](./imgs/program-structure.png) + +### Working with OpenMP + +We will now look at a simple example. + +> The code can be compiled with `gcc -fopenmp -o hello hello.c`. + +```c +#include +#include + +int main() { + int i; + #pragma omp parallel for + for (i = 0; i < 10; i++) { + printf("Thread %d executing iteration %d\n", omp_get_thread_num(), i); + } + return 0; +} +``` + +## Running on M3 + +Here is a template script provided in the home directory in M3. Notice that we can dynamically change the number of threads using `export OMP_NUM_THREADS=12` + +```bash +#!/bin/bash +# Usage: sbatch slurm-openmp-job-script +# Prepared By: Kai Xi, Apr 2015 +# help@massive.org.au + +# NOTE: To activate a SLURM option, remove the whitespace between the '#' and 'SBATCH' + +# To give your job a name, replace "MyJob" with an appropriate name +# SBATCH --job-name=MyJob + + +# To set a project account for credit charging, +# SBATCH --account=pmosp + + +# Request CPU resource for a openmp job, suppose it is a 12-thread job +# SBATCH --ntasks=1 +# SBATCH --ntasks-per-node=1 +# SBATCH --cpus-per-task=12 + +# Memory usage (MB) +# SBATCH --mem-per-cpu=4000 + +# Set your minimum acceptable walltime, format: day-hours:minutes:seconds +# SBATCH --time=0-06:00:00 + + +# To receive an email when job completes or fails +# SBATCH --mail-user= +# SBATCH --mail-type=END +# SBATCH --mail-type=FAIL + + +# Set the file for output (stdout) +# SBATCH --output=MyJob-%j.out + +# Set the file for error log (stderr) +# SBATCH --error=MyJob-%j.err + + +# Use reserved node to run job when a node reservation is made for you already +# SBATCH --reservation=reservation_name + + +# Command to run a openmp job +# Set OMP_NUM_THREADS to the same value as: --cpus-per-task=12 +export OMP_NUM_THREADS=12 +./your_openmp_program +``` diff --git a/src/chapter6/synchronisation.md b/src/chapter6/synchronisation.md new file mode 100644 index 0000000..132f56f --- /dev/null +++ b/src/chapter6/synchronisation.md @@ -0,0 +1,257 @@ +# Synchronisation + +Definition: Synchronisation is the task of coordinating multiple of processes (or threads) to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action. + +## Race Condition + +Let's start with this simple program: + +```c +/* +We purposefully added the following code within the program: +- The sleep() calls allow thread switching in the middle of function calls. +- The silly variable assignments in increment() mimic the register. +- All functions are sharing a global counter variable. + +Note that: +- Even if we remove all of the sleep() and the variable assignments, +the error can still occur by chance. + +What should be the desired output? +What is the actual output? +*/ +#include +#include +#include + +float sleep_time = 0.1; +int counter = 0; // Sharing across the program + +int get_value() { + sleep(sleep_time); // This will cause thread switching + printf("Current Counter = %d\n", counter); + return counter; +} + +void increment() { + int temp = counter; // Load counter to register + sleep(sleep_time); // This will cause thread switching + temp++; // Increment the register + counter = temp; // Store back to the variable + + printf("Incremented counter to %d\n", counter); +} + +int main() { +#pragma omp parallel for + for (int i = 0; i < 5; i++) { + increment(); + get_value(); + } + + return 0; +} +``` + +### Single Thread + +Running the program using 1 thread: +```bash +export OMP_NUM_THREADS=1 +./counter +``` +The output should look something like this: + +![1 thread counter](./imgs/one-thread-counter.png) + +The output coordinates with what we expected. +- This is because we only used 1 single thread. +- The program is just a sequential program without any parallism. +- `sleep()` calls simply put the thread to sleep, that same thread will go to sleep, wake up, and continue the execution. + +### Multiple Threads + +```bash +export OMP_NUM_THREADS=2 +./counter +``` + +Running the program using 2 threads may give us this output (this is just 1 **possible** output): + +![alt text](./imgs/two-threads-counter.png) + +What is happening here? +- We are using 2 threads. +- Both threads are trying to access the global variable `counter` at the same time (roughly). +- During the time when 1 thread is sleeping, the other thread may increment the shared counter. +- The 2 threads simply go on their way and not coordinate with each other. + +> What we are having here is `Race Condition`. A race condition occurs when two or more threads can access `shared data` and they try to `change it at the same time`. + +### How to resolve the problem? + +There are a few ways we can resolve the race condition in OpenMP: + +* **Critical construct**: This restricts the code so that only one thread can do something at a time (in our example, only 1 thread can increment the counter at a time). However, it is `bad for performance` and possibly destroy a lot of the gains from running code in parallel in the first place. + +```c +int main() { +#pragma omp parallel for + for (int i = 0; i < 5; i++) { +#pragma omp critical // Critical construct + increment(); + get_value(); + } + return 0; +} +``` + +* **Atomic construct**: This is quite similar to Critical construct, however, it only applies to memory read/write operations. It has a better performance than the Critical construct by taking advantage on the hardware. There's no lock/unlock needed on entering/exiting the line of code, it just does the atomic operation which the hardware tells you can't be interfered with. Let's look at another example: + +> Run this program multiple times using multiple threads (before uncommenting the construct). Again, race condition! + +```c +#include +#include + +int total = 0; +int n = 100; +int nums[100]; + +int main() { + // Populate nums + for (int i = 0; i < n; i++) { + nums[i] = i; + } + +#pragma omp parallel for + for (int i = 0; i < n; i++) { + int temp = nums[i]; + /* + We can easily resolve the race condition with atomic/critical construct. + The atomic one will work perfectly and give better performance this time. + Uncomment the construct below to resolve the race condition. + */ +// #pragma omp atomic + total += temp; + } + printf("%d\n", total); +} +``` + +* **Reduction**: Based on the problem, sometimes, the best solution will be to use `reduction`. Let's analyse what this code is doing: + +> Using `reduction` here results in significantly better performance. +> - A quick way to do some simple benchmarking is: `time a-command` +> - Conduct benchmarking for 3 versions, and trying in different number of threads + +Example: +```bash +# Tuning the number of threads +export OMP_NUM_THREADS=4 + +# Change according to your file's name +time ./critical +time ./atomic +time ./reduction +``` + +```c +#include +#include + +int total = 0; +int n = 100; +int nums[100]; + +int main() { + // Populate nums + for (int i = 0; i < n; i++) { + nums[i] = i; + } + +#pragma omp parallel for reduction(+:total) num_threads(3) + for (int i = 0; i < n; i++) { + int temp = nums[i]; + total += temp; + } + printf("Final total is: %d\n", total); +} +``` + +> Notice that: +> - The previous two approaches only allow 1 thread at a time to perform some operations. +> - Reduction allows threads to access the same shared data at the same time, but in different parts of the data. +> +> The nature of the word **synchronisation** in these two examples is completely different from each other, while still adhering to our initial definition! + +## Barrier Synchronisation + +In the last sub-chapter, we have talked about the [Fork - Join Model](./multithreading.md#fork-join-parallel-execution-model). We know that **"Once the team threads complete the parallel region, they `synchronise` and return to the pool, leaving only the master thread that executes sequentially."**. However, there are a few important aspects that we have left out: +- The time taken to finish the assigned task is **different** for each thread. +- How can OpenMP know/identify **when** a thread has completed its own task. +- How can OpenMP know/identify **when** all threads have finished all the tasks. + +The answer lies in something called **Barrier Synchronisation**. Here are illustrations for the idea: + +![Barrier Illustration](./imgs/barrier.png) + +![Barrier Wait](./imgs/barrier-wait.png) + +![Barrier End](./imgs/barrier-end.png) + +### Implicit Barriers + +The barrier synchronisation implicitly (behind the scene) occur at the end of constructs such as parallel construct ("`#pragma omp parallel`") and the end of worksharing constructs(loop, sections, single, and workshare constructs). + +```c +#include +#include + +int main(void) +{ + #pragma omp parallel { + // Parallel code + printf("Thread %d is executing...\n", omp_get_thread_num()); + } + + // Sequential code after the barrier + printf("Main thread\n"); + return 0; +} +``` + +### Barrier Construct + +The barrier construct specifies an **explicit** (We add the construct into the code by ourselves) barrier at the point at which the construct appears. The barrier construct is a stand-alone directive. Here is an illustration of the following code. + +![Explicit Barrier](./imgs/explicit-barrier.png) + +```c +#include +#include + +int main(void) +{ + #pragma omp parallel + { + printf("Thread %d executes part 1\n", omp_get_thread_num()); + #pragma omp barrier + + // No thread will execute part 2 before part 1 + printf("Thread %d executes part 2\n", omp_get_thread_num()); + } + return 0; +} +``` + +### Let's think about a way to implement a barrier + +We don't need to know exactly how OpenMP implemented this feature, at least not right now (if you are interested in OpenMP implementation, [here](https://www.openmp.org/spec-html/5.0/openmpse25.html) could be a start). We can follow a rough simple approach: + +- Let's assume we have `n` threads. +- We need a way to count how many threads that have finished, this can easily be done with a shared counter variable (be careful with race condition) among threads. When this counter reaches the number `n`, we will know that all threads have finished. +- We also need a mechanism to make a finished thread idle and **wait()** for other threads to finish. +- The last thread to finish has the responsibility of **notify()** other threads (threads that you want to be executed after the barrier). + +Voila! we have a barrier. We will implement barrier as part of a mini-project using [Posix Thread](https://docs.oracle.com/cd/E26502_01/html/E35303/tlib-1.html). diff --git a/src/chapter7/challenges.md b/src/chapter7/challenges.md new file mode 100644 index 0000000..a31dd95 --- /dev/null +++ b/src/chapter7/challenges.md @@ -0,0 +1,3 @@ +# Challenges + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter7/chapter7.md b/src/chapter7/chapter7.md new file mode 100644 index 0000000..25d3de6 --- /dev/null +++ b/src/chapter7/chapter7.md @@ -0,0 +1,3 @@ +# Parallellisation of Algorithms + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter7/imgs/Beale_contour.svg.png b/src/chapter7/imgs/Beale_contour.svg.png new file mode 100644 index 0000000..b3dc2b1 Binary files /dev/null and b/src/chapter7/imgs/Beale_contour.svg.png differ diff --git a/src/chapter7/imgs/Rastrigin_contour_plot.svg.png b/src/chapter7/imgs/Rastrigin_contour_plot.svg.png new file mode 100644 index 0000000..27ea052 Binary files /dev/null and b/src/chapter7/imgs/Rastrigin_contour_plot.svg.png differ diff --git a/src/chapter7/imgs/Rosenbrock_contour.svg.png b/src/chapter7/imgs/Rosenbrock_contour.svg.png new file mode 100644 index 0000000..6a13bef Binary files /dev/null and b/src/chapter7/imgs/Rosenbrock_contour.svg.png differ diff --git a/src/chapter7/machine-learning-and-hpc.md b/src/chapter7/machine-learning-and-hpc.md new file mode 100644 index 0000000..646339c --- /dev/null +++ b/src/chapter7/machine-learning-and-hpc.md @@ -0,0 +1,18 @@ +# Machine Learning & HPC + +## What is Machine Learning? + +Machine learning is a subset of artificial intelligence that involves developing AI models that enable computers to learn from and make predictions or decisions based on vast amounts of data. It pertains to training the model on these large datasets to identify patterns and insights, and automatically learn about these data to predict results from new, unseen data. The training datasets contain pre-determined outputs and their respective inputs, allowing the ML model to pick up patterns that enable these outputs to occur. + +## How a ML model is trained + +During the training phase of a ML algorithm, it uses the training dataset to predict the output, and then compares the predicted output to the actual output to measure the disparity and compute the loss. Next, the gradient of the loss against the model parameters is calculated and used to update the model parameters via gradient descent. This process is repeated for a specific number of times or until the loss is below a certain threshold. +Since ML algorithms often deal with massive training datasets and complex models, the training usually takes a very long time and requires significant computational resources. HPC provides a way to speed this process up by distributing the dataset or model across different nodes/GPUs. There are two main forms of parallelism - Data Parallelism and Model Parallelism. + +## Data Parallelism + +Data parallelism involves dividing and distributing the training dataset across multiple nodes/GPUs. Each node receives a copy of the ML model, and they individually process their portion of the data to compute the gradient. The gradients are then averaged and the weights in all the models in all the nodes are appropriately tuned. Data parallelism is more suited for instances where the training dataset is massive and the individual samples can be processed independently, such as CNNs. + +## Model Parallelism + +Model parallelism involves partitioning the layers of the model across nodes/GPUs, allowing different parts of the model to process different batches of data simultaneously. Each node receives a layer of the ML model, and these nodes are aligned in a way that the output of one node is an input to another. These nodes individually process a different batch of data simultaneously, and then pass the data and activations to the next node. The same batch of data is passed from one node to another, with the next batch following right after. Model parallelism is more suited for models with large memory requirements or complex architectures, such as NLPs. diff --git a/src/chapter7/optim-algos.md b/src/chapter7/optim-algos.md new file mode 100644 index 0000000..4308f82 --- /dev/null +++ b/src/chapter7/optim-algos.md @@ -0,0 +1,3 @@ +# Optimisation Algorithms + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter7/optimisation-algorithms.md b/src/chapter7/optimisation-algorithms.md new file mode 100644 index 0000000..85cba04 --- /dev/null +++ b/src/chapter7/optimisation-algorithms.md @@ -0,0 +1,103 @@ +# Optimisation Algorithms + +Optimisation algorithms are a family of algorithms which aim to find the best combination of some variables, +to maximise (or minimise) some kind of output. +Though that may sound quite abstract, +optimisation algorithms can in fact be applied to a variety of different practical problems. +Some examples where optimisation algorithms could be used include: + +- An bakery trying to figure out which ingredients to buy to maximise profit +- A small business owner attempting to create the 'optimal' shift for their workers +- A deep learning engineer trying to find the best parameters for their artificial intelligence model + +Though at surface level all these problems may look appear quite different, +each of these situations can be modelled mathematically in similar ways, +allowing us to use techniques from linear algebra to help us out. + +# Linear Algebra + +One can think of optimisation algorithms as an attempt to answer the question "what is the best choice one can make right now?" +In order to find the *best* choice, we should figure out some way to organise the choices we *can* make. + +In virtually all cases, one can model the list of all choices as just a list of numbers, with each number corresponding to one parameter that can be adjusted, for example + +- If the bakery has 10 ingredients, then the numbers could be an (ordered) list of the amount of each ingredient +- The small business owner can represent a whole shift as a list of numbers, each number corresponding the shifts of each worker +- The deep learning engineer probably has a list of parameters already, and they just need to fine tune the coefficients + +This is very similar to the coordinate system taught in high school. +The same way we think of two numbers in a coordinate system as a point on the plane (like (0,0) being at the centre of the xy plane), +we can think of these lists of numbers as points on some higher dimensional plane. +See how even in the small example of the bakery with 10 ingredients, we're already dealing with 10 dimensional space. +With the small AI, one might use a model with 128 parameters. +Meaning, the mathematic represents some kind of 128-dimensional space. + +If you can't imagine 128 dimensions of space, that's fine. +what's important is that the logic and systems that are true in 2 or 3 dimensions are also true in higher dimensions. +The analogy of "points in space" will also be useful in a bit. +In mathematics jargon, one could say an optimisation problem with *n* variables can be modelled as a vector space with *n+1* dimensions. +Why *n+1*? see below. + +# Cost Function + +Once we have a way of describing all the choices we can make, we need a way of saying how good each choice is. +In other words, +we need to take each point in our space, +and associate this with some metric of "goodness". +The better that point is, the higher number it receives. +Mathematically, +one would conceptualise this as a function which takes in the parameters as input, +and outputs something outside of those parameters. +That's where the extra *+1* variable comes from + +This metric would arise naturally out of the problem we are trying to study. +For the bakery example, it would be something like how much profit each choice can generate. +In the AI example, it would probably be the performance of the model. + +# Why do we need these algorithms? + +Here's an exercise: + +Draw a square of radius 10 and fill in with a bunch of dots in random positions. +Then, cut out 10 circles of radius 1. +Now I ask you the question, +how many dots can you cover at most? + +This is a very strange question, +and its not immediately obvious how one would approach this problem. +if the points were laid out in front of us, +a human could probably intuit the best placement of circle cut outs. +But building an algorithm that can reliably cover a good amount of circles is quite difficult. + +That's the point of these algorithms. +To construct a way for computers to mimic the human intuition of a strange and complicated problem. +We'll see some algorithms that can do that in a bit. +But first, some more weird functions. + +# strange functions + +Here's some images of the output of functions + +![Beale Contour](./imgs/Beale_contour.svg.png) +![Rastrigin Contour](./imgs/Rastrigin_contour_plot.svg.png) +![Rosenbrock Contour](./imgs/Rosenbrock_contour.svg.png) + +These are the Beale Contour, Rastrigin Contour, and the Rosenbrock Contour respectively. +Notice that it is not immediately obvious how a computer should determine the minima (or maxima) of these functions. +Checking every single point becomes radically exhaustive extremely quickly. + +# Algorithms + +personally, I view these algorithms as the *definition* of machine learning. +As in, training an artificial intelligence model is essentially a quest to find the right numbers to put into a model, +such that said model performs optimally. + +## Gradient Descent + +*differentiable* functions + +## Genetic Algorithm + +## Constraint programming + +# Relation to HPC diff --git a/src/chapter7/other-parallel-algos.md b/src/chapter7/other-parallel-algos.md new file mode 100644 index 0000000..5c59782 --- /dev/null +++ b/src/chapter7/other-parallel-algos.md @@ -0,0 +1,3 @@ +# Other Parallel Algorithms + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter7/parallel-search.md b/src/chapter7/parallel-search.md new file mode 100644 index 0000000..47811ba --- /dev/null +++ b/src/chapter7/parallel-search.md @@ -0,0 +1,3 @@ +# Parallel Search + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter7/parallel-sort.md b/src/chapter7/parallel-sort.md new file mode 100644 index 0000000..d740655 --- /dev/null +++ b/src/chapter7/parallel-sort.md @@ -0,0 +1,3 @@ +# Parallel Sort + +![under-const](../imgs/under-const.gif) \ No newline at end of file diff --git a/src/chapter8/challenges.md b/src/chapter8/challenges.md new file mode 100644 index 0000000..169d112 --- /dev/null +++ b/src/chapter8/challenges.md @@ -0,0 +1,47 @@ +# Apache Spark Challenges + +## Overview + +- [Apache Spark Challenges](#apache-spark-challenges) + - [Overview](#overview) + - [Task 1 - Classic Distributed Problem: Token Counting](#task-1---classic-distributed-problem-token-counting) + - [Task 2 - Cluster Set-up Bash Scripts](#task-2---cluster-set-up-bash-scripts) + - [Task 3 - Spark and Slurm](#task-3---spark-and-slurm) + - [Task 4 - Data Processing](#task-4---data-processing) + - [Task 5 - Spark Machine Learning](#task-5---spark-machine-learning) + +> Note: Tasks 1, 2, and 3 closely resemble a **typical workflow** when working with Apache Spark: +> - **Step 1**: Interactively work with a small sample of the problem +> - **Step 2**: Solve and optimize the sample problem +> - **Step 3**: Submit the entire larger problem as a batch job +> - **Step 4**: Analyze the result and, if necessary, repeat steps 1 to 4 +> +> You should employ this workflow into task 4 and task 5 + +## Task 1 - Classic Distributed Problem: Token Counting + +Given a string of tokens, count the number of times each token apprears. You should do this task in an interactive JupyterLab notebook connecting to a Spark cluster. This is a cananical problem of distributed data processing, and often served as an example for [MapReduce Programming Model](https://en.wikipedia.org/wiki/MapReduce). + +> Hint: Have a look at [map()](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.map.html) and [reduceByKey()](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduceByKey.html) + +## Task 2 - Cluster Set-up Bash Scripts + +Write Bash Scripts to streamline the process of installing Spark and running the cluster. +> Hint: Try to combine the [subchapter: set up](./set-up.md) + +## Task 3 - Spark and Slurm + +Submit [task 1](#task-1---calculate-pi-using-monte-carlo-algorithm-again) as a Spark job using Slurm. This should be similar to [subchapter: job batching](./job-batching.md) +> Hint: +> - You will need to convert the notebook into a Python file. +> - Compare the content of `$SPARK_HOME/examples/src/main/python/pi.py` and [our Monte Carlo Pi Estimation](./internals.md#monte-carlo-pi-estimation). They both solve the same problem, however, there are stuffs that we don't need to add when directly using `spark-submit`. Why? + +## Task 4 - Data Processing + +In this task, we will start working witha dataframe and try to process a given real-world dataset. + +> The dataset, at around ~100MB, is considered small and not well-suited for Spark utilization (opting for Pandas might be more efficient). Nevertheless, working with this dataset serves as an exercise to understand more about Spark concepts and its capabilities. + +## Task 5 - Spark Machine Learning + +We will use the data from task 4 to build an intro-to-Machine-Learning model, [Linear Regression](https://en.wikipedia.org/wiki/Linear_regression), with [MLlib](https://spark.apache.org/mllib/) diff --git a/src/chapter8/chapter8.md b/src/chapter8/chapter8.md new file mode 100644 index 0000000..f83c84d --- /dev/null +++ b/src/chapter8/chapter8.md @@ -0,0 +1,16 @@ +# Apache Spark + +Apache Spark is an open-source, distributed computing system that has gained immense popularity for its speed, ease of use, and versatility in handling large-scale data processing tasks. Developed to overcome the limitations of the MapReduce paradigm, Spark offers a unified platform for various data processing workloads, including batch processing, real-time data streaming, machine learning, and graph processing. + +Spark provides high-level APIs in languages like Scala, Java, Python, and R, making it accessible to a wide range of developers with different programming backgrounds. + +In this chapter, we will: +- Set up a mini Spark cluster in M3. +- Take a closer look at the internal data structure, specifically Resilient Distributed Datasets (RDDs). +- Explore data processing in Spark and JupyterLab. +- Submit batch jobs utilizing both Slurm and Spark. +- Engage in some challenges. + +> Notes: +> - The material covered in this chapter draws heavily from the [official documentation of Spark 3.5.0](https://spark.apache.org/docs/latest/index.html). +> - Contents of [Setting up a Spark Cluster within M3 via Slurm](./set-up.md#setting-up-a-spark-cluster-within-m3-cluster) and [Submit Spark Job inside Slurm Job](./job-batching.md#job-batching) are both derived from a trial-and-error approach, and doesn't adhere to any official documentation. Consequently, there is a likelihood that it may not be the best practice. Thus, if you've discovered alternative methods or more effective approaches or may be even security vulnerabilities, please don't hesitate to submit a pull request. diff --git a/src/chapter8/data-processing.md b/src/chapter8/data-processing.md new file mode 100644 index 0000000..94cd30a --- /dev/null +++ b/src/chapter8/data-processing.md @@ -0,0 +1,26 @@ +# Data Processing with Spark + +## Spark SQL + +Spark SQL, a module of Spark, is designed for structured data processing. Unlike the basic Spark RDD API, the interfaces in Spark SQL offer more details about the data's structure and the ongoing computation. Internally, Spark SQL utilizes this additional information to make optimizations for improved performance. + +### Architecture Overview + +![Spark SQL Architecture](./imgs/spark-sql.png) + +### Capabilities + +Spark SQL provides 3 main capabilities: +- Load data from a variety of structured sources (e.g., JSON, Hive, and Parquet). +- Query the data using SQL, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC), such as business intelligence tools like Tableau. +- Provides rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more. Many jobs are easier to write using this combination. + +### Datasets and DataFrames + +A Dataset represents a distributed assembly of data, combining the advantages of Resilient Distributed Datasets (RDDs) with the efficiency of Spark SQL's optimized execution engine. + +Meanwhile, a DataFrame is a structured form of a Dataset, arranged into columns. It parallels the notion of a table in a relational database or a data frame in languages like R or Python, yet incorporates more sophisticated optimizations internally. + +### Working with PySpark DataFrames + +> We will now have a look at the examples provided in ```path/to/spark-3.5.0-bin-hadoop3/examples/src/main/python/sql/basic.py``` diff --git a/src/chapter8/imgs/jupyterlab.png b/src/chapter8/imgs/jupyterlab.png new file mode 100644 index 0000000..04bfeae Binary files /dev/null and b/src/chapter8/imgs/jupyterlab.png differ diff --git a/src/chapter8/imgs/spark-architecture.png b/src/chapter8/imgs/spark-architecture.png new file mode 100755 index 0000000..7713f78 Binary files /dev/null and b/src/chapter8/imgs/spark-architecture.png differ diff --git a/src/chapter8/imgs/spark-cluster-overview.png b/src/chapter8/imgs/spark-cluster-overview.png new file mode 100755 index 0000000..f9d8334 Binary files /dev/null and b/src/chapter8/imgs/spark-cluster-overview.png differ diff --git a/src/chapter8/imgs/spark-sql.png b/src/chapter8/imgs/spark-sql.png new file mode 100755 index 0000000..b608d48 Binary files /dev/null and b/src/chapter8/imgs/spark-sql.png differ diff --git a/src/chapter8/internals.md b/src/chapter8/internals.md new file mode 100644 index 0000000..d101f7c --- /dev/null +++ b/src/chapter8/internals.md @@ -0,0 +1,175 @@ +# Spark Architecture Overview + +Here is a diagram representing components of Apache Spark: + +![Architecture](./imgs/spark-architecture.png) + +> Within this chapter, we only look at components with blue boxes. + +## Cluster Managers + +Spark is designed to smoothly scale from just one compute node to potentially thousands. It achieves this scalability while prioritizing flexibility by supporting various cluster managers, including Hadoop YARN, Apache Mesos, and its own Standalone Scheduler. For our setup, we've opted for the simplicity of the Standalone Scheduler to run a mini-cluster in the last section. + +## Spark Core - Resillient Distributed Datasets (RDDs) + +### What is RDDs? + +RDD is simply an `immutable` distributed collection of objects. Within Spark, all tasks involve either generating new RDDs, modifying existing ones, or executing operations on RDDs to produce an outcome. Behind the scenes, Spark seamlessly disperses the data stored within RDDs into multiple partitions across your cluster and parallelizes the tasks you execute on them. Moreover, RDD can accommodate Python, Java, or Scala objects of any type, including user-defined classes. RDDs enable users to explicitly `persist` intermediate outcomes in memory, `control their partitioning` to optimize data distribution, and manipulate them using a diverse `set of operators`. + +### Initialize RDD + +#### Parallelize Collections + +We can create a RDD from an existing iterable or collection + +```python +data: list = ["H", "e", "l", "l", "o", "!"] + +# by default, set the number of partitions automatically based on your cluster +chars: RDD = spark_context.parallelize(data) + +# you can also set it manually by passing it as a second parameter to parallelize +chars: RDD = spark_context.parallelize(data, 10) +``` + +#### Read from External Datasets + +PySpark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3,... + +```python +# takes a URI for the file (either a local path on the machine, or a hdfs://, s3a://, etc URI) +# reads it as a collection of lines +lines: RDD = spark_context.textFile("/path/to/hello.txt") +``` + +Once created, RDD can be performed on by a diverse list of functional-style operations. + +### RDD Operations + +#### Transformations and Actions + +RDDs support two types of operations `transformations` and `actions`: +- Transformations are operations on RDDs that return a new RDD by applying the same operation to many data items. Transformations are `lazily evaluated` (including loading data), it will only be computed when an action is called. Spark internally records metadata to indicate that a transformation has been requested. +- Actions are operations that run computation on a RDD then return a value (non-RDD) or export data to a storage system. + +#### Why laziness? + +Many popular cluster computing frameworks (such as MapReduce and Dryad) offer high-level operators for parallel computations, simplifying work distribution and fault tolerance concerns. However, they lack abstractions for efficiently optimizing distributed memory when dealing with applications that frequently reuse intermediate results ([iterative methods](https://en.wikipedia.org/wiki/Iterative_method), [interactive data mining](https://en.wikipedia.org/wiki/Data_mining), etc). + +Within the majority of these frameworks, the exclusive method for reusing data between computations (e.g., between two MapReduce jobs) involves storing it in an external stable storage system, such as a distributed file system, resulting in significant overhead due to data replication, disk I/O operations, and serialization. Therefore, we will often have to spend time considering how to group together operations to minimize the number of MapReduce passes through our data. + +> Spark employs lazy evaluation to: +> - Minimize the number of passes it makes over the data by consolidating operations. +> - Allow more rich and flexible combinations of Transformations. + +#### APIs Reference + +A complete list of PySpark APIs is available at [Apache Spark Documentation](https://spark.apache.org/docs/latest/api/python/reference/index.html). If you're ever uncertain about whether a specific function is a transformation or an action, simply check its return type: `transformations` return `RDDs`, whereas `actions` return some `other data type`. + +#### Examples + +Let's look at a few simple examples of using PySpark + +> You should follow these examples on a JupyterLab connecting to a Spark Cluster + +##### Monte Carlo Pi Estimation + +```python +import pyspark +from pyspark.sql import SparkSession +from random import random +from operator import add + +# Set this to the name of the current master node of the Spark Cluster +# Eg: master_node = "m3i014" +master_node = ... + +###### Initilize Spark Session ###### +spark = SparkSession \ + .builder \ + .master(f"spark://{master_node}.massive.org.au:7077") \ + .appName("PythonJupyter") \ + .config("spark.workers.show", "True") \ + .getOrCreate() + +###### Monte Carlo Pi Estimation ###### + +partitions = 4 +n = 1000 * partitions +data = range(1, n + 1) + +def point(_: int) -> float: + x = random() * 2 - 1 + y = random() * 2 - 1 + return 1 if x ** 2 + y ** 2 <= 1 else 0 + +count = spark + .sparkContext \ + # distribute data into partitions across cluster, we have the starting RDD here + .parallelize(data, partitions) \ + # Transformation: map each data through a function, resulting in a new RDD + .map(point) \ + # Action: Reduce all the data into a value + .reduce(add) + +print("Pi is roughly %f" % (4.0 * count / n)) +``` + +##### Sort a Text File + +```bash +# Randomly generate number and store in a file called "rand_ints.csv" +# Note: This is bash command +for i in {1..1000}; do echo $[($RANDOM)]; done > ./rand_ints.csv +``` + +```python +from typing import Tuple +from pyspark.rdd import RDD +from pyspark.sql import SparkSession + +# Set this to the Path of the file rand_ints.csv +# Note: Don't use Relative Path +file_path = ... +lines = spark \ + .read \ + .text(file_path) \ + .rdd \ + .map(lambda r: r[0]) + +sortedCount: RDD[Tuple[int, int]] = lines \ + .flatMap(lambda x: x.split(' ')) \ + .map(lambda x: (int(x), 1)) \ + .sortByKey() +``` + + +### Shared Variables + +Typically, when a function is passed to a Spark operation like map or reduce and executed on a remote cluster node, it operates on separate copies of all the variables used within the function. These variables are duplicated across each machine, and any updates made to these variables on the remote machine are `not` communicated back to the driver program. Fortunately, Spark does provide 2 types of shared variables: `Broadcast variables` and `Accumulators` + +#### Broadcast variables + +Broadcast variables allow the program to efficiently send a large, `read-only` value to all the worker nodes for use in one or more Spark operations exactly once. The broadcasted variable will stay in each node until one of these cases: +- All tasks are finished. +- A [`unpersist()`](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.unpersist.html) method is called. However, this method will only release the resources in the executor node (the variable will be re-broadcasted again when it is used). +- A [`destroy()`](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.Broadcast.destroy.html) is called. This time, the variable can't be used again. + +> Note that these methods do not block by default! To block until resources are freed, specify `blocking=true` when calling them. + +```python +broadcastVar = spark_context.broadcast([1, 2, 3]) +print(broadcastVar.value) # [1, 2, 3] +``` + +#### Accumulators + +Another type of shared variable, accumulators, provides a simple syntax for aggregating values from worker nodes back to the driver program. + +> Note that it is possible aggregate values from an entire RDD back to the driver program using actions (e.g: `reduce()`). However, in the case when we want a simpler way to calculate a metric at some point in the program (e.g: count the number of the word "Hello" in a distributed text data), accumulator will be a handly solution. + +```python +accum = spark_context.accumulator(0) +spark_context.parallelize([1, 2, 3, 4]).foreach(lambda x: accum.add(x)) +print(accum.value) # 10 +``` diff --git a/src/chapter8/job-batching.md b/src/chapter8/job-batching.md new file mode 100644 index 0000000..03d8802 --- /dev/null +++ b/src/chapter8/job-batching.md @@ -0,0 +1,59 @@ +# Job Batching + +## spark-submit + +The spark-submit script found in Spark's bin directory serves to initiate applications on a cluster. It seamlessly integrates with all of Spark's supported cluster managers via a unified interface, eliminating the need for specific configuration adjustments tailored to each one individually. + +```bash +# Remember to change the master_node to actual master node's name +source /path/to/spark-3.5.0-bin-hadoop3/bin/spark-submit \ + --master spark:{master_node}:7077 \ + hello_world.py +``` + +> Here is a comprehensive [documentation](https://spark.apache.org/docs/3.5.0/submitting-applications.html#launching-applications-with-spark-submit) of all `spark-submit` options. We will not look into the detail of it (If you need to find any command, it should be better to look directly into the documentation itself). +> +> We will focus on M3 and see how we can bundle and submit everything as Slurm jobs. + +## Slurm and Spark + +Here is a complete bash script that we can `sbatch` to M3. We will now go through what is happening here: + +```bash +#!/bin/bash +#SBATCH --job-name=SparkSlurm +#SBATCH --nodes=4 +#SBATCH --time=00:15:00 +#SBATCH --ntasks=1 +#SBATCH --mem-per-cpu=4096 +#SBATCH --cpus-per-task=1 +#SBATCH --partition=m3i + +############################ Spark Cluster ####################################### + +# This is basically the same steps as seen in the first subchapter + +# Remember to change this to the actual path +export SPARK_HOME="/path/to/spark-3.5.0-bin-hadoop3" + +# Add to the global PATH +export PATH="${SPARK_HOME}/bin:${PATH}" + + # Get all worker nodes and write to the config file of Spark +scontrol show hostnames | sed -n '2,$p' > $SPARK_HOME/conf/workers + +# Start the spark cluster +$SPARK_HOME/sbin/start-all.sh + +################################### Submit Job ################################### + +# Get the master node, which will be the first node from the command `scontrol show hostnames` +master_node=$(scontrol show hostnames | head -1) + +# Submit job to the Spark Cluster using spark-submit +# In this script we submit a provided example in the spark directory +$SPARK_HOME/bin/spark-submit \ + --master spark://${master_node}:7077 \ + $SPARK_HOME/examples/src/main/python/pi.py \ + 1000 +``` diff --git a/src/chapter8/set-up.md b/src/chapter8/set-up.md new file mode 100644 index 0000000..f9bfd5e --- /dev/null +++ b/src/chapter8/set-up.md @@ -0,0 +1,118 @@ +# Setting up a Spark cluster within M3 cluster + +## Installation + +Different Spark version may have different version requirements for Python and Java. We will use Spark 3.5, Python 3.10 (JupyterLab Environment), and Java 8 (default in M3). There are 2 main things that we need to set up: + +- **Miniconda** + +Conda can be installed using the conda-install module from M3. The JupyterLab environment currently uses Python 3.10, meeting the prerequisite for Spark 3.5. This environment will also be utilized for running a notebook in STRUDEL. For those interested in customizing their own conda environment on M3, detailed instructions can be found in [this instruction](https://docs.massive.org.au/M3/connecting/strudel2/connecting-to-byo-jupyter.html). + +```bash +module load conda-install +conda-install +``` + +- **Spark Engine** + +Since M3 does not include a built-in module for Spark, we need to download and unzip it. Other versions can be found [here](https://spark.apache.org/downloads.html) + +```bash +# Spark 3.5.0 +wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz +tar xzfv spark-3.5.0-bin-hadoop3.tgz +rm -rf spark-3.5.0-bin-hadoop3.tgz +``` + +## Getting the cluster running + +### Requesting nodes from M3 + +We can refer to the [M3 chapter](/chapter3/slurm.md) to request an interactive session. E.g: +```bash +# Interactive session with: 12 cpus, 1 hour of time, 4 nodes, 8gb memory per node on m3i partion +smux new-session --jobname=testing_spark --ntasks=4 --time=0-01:00:00 --partition=m3i --mem=8GB --nodes=4 --cpuspertask=4 +``` + +### Setting up worker + +- Write requested nodes name to the config file within spark +```bash +# Get names of all nodes +# Read from 2nd node to the last node (as the 1st node is the master node) +# Write it to the config file workers within spark engine +scontrol show hostnames | sed -n '2,$p' > /path/to/spark-3.5.0-bin-hadoop3/conf/workers +``` + +### Starting spark cluster + +- Start the cluster +```bash +source /path/to/spark-3.5.0-bin-hadoop3/sbin/start-all.sh +``` + +## Connecting Jupyter Lab (STRUDEL) to the Cluster + +[Spark’s shell](https://spark.apache.org/docs/latest/quick-start.html#interactive-analysis-with-the-spark-shell) offers a straightforward approach to grasp the API and serves as a robust tool for interactive data analysis. However, incorporating a user-friendly interface facilitates more intricate operations and functionalities. In this context, we'll utilise Jupyter Lab, which is made available through STRUDEL. + +### Requesting a notebook through STRUDEL + +When installing miniconda in M3, there will be a built-in environment named ```jupyterlab```. This environment will also be automatically added into STRUDEL and we can launch a JupyterLab session from this environment. + +![JupyterLab](imgs/jupyterlab.png) + +### Exporting environments variables + +```python +import sys +import os + +# Specify paths, which will be set as environment variables and read by the Spark engine +SPARK_HOME = "/path/to/spark-3.5.0-bin-hadoop3" +os.environ["SPARK_HOME"] = SPARK_HOME +PREV_PATH = os.environ["PATH"] +os.environ["PATH"] = f"{SPARK_HOME}/bin:{PREV_PATH}" + +# Specify which Python to be used by the driver and workers +os.environ["PYSPARK_PYTHON"] = sys.executable + +# Add pyspark to the PYTHONPATH +PYSPARK_HOME=f"{SPARK_HOME}/python" +sys.path.append(PYSPARK_HOME) + +# Different Spark version will have different py4j version +sys.path.append(f"{PYSPARK_HOME}/lib/py4j-0.10.9.7-src.zip") +``` + +### Connecting to the Spark master + +```python +import pyspark +from pyspark.sql import SparkSession + +# Initilize a Spark context from the Spark Cluster +spark = SparkSession \ + .builder \ + .master("spark://{master_node}.massive.org.au:7077") \ + .appName("test") \ + .config("spark.workers.show", "True") \ + .getOrCreate() +``` + +### SSH Tunnel SparkUI to Localhost + +We can use [ssh tunneling](https://www.ssh.com/academy/ssh/tunneling) to forward the Spark UI of the engine to our local machine, the UI will be available as a web application at ```http://localhost:4040```. + +```bash +# The jupyterlab node will be the node that the jupyter lab is currently running on +ssh -L 4040:{jupyterlab_node}.massive.org.au:4040 {username}@m3.massive.org.au +``` + +## Overview of what we have just done + +The overall architecture can be demonstrated through this diagram: +- Each circle is a computing node of M3. +- A mini-cluster managed by Spark is running inside the big M3 cluster. +- We also have a notebook (from STRUDEL) connecting to the master node of the Spark cluster. + +![Overview](imgs/spark-cluster-overview.png) \ No newline at end of file diff --git a/src/imgs/under-const.gif b/src/imgs/under-const.gif new file mode 100644 index 0000000..d155fb6 Binary files /dev/null and b/src/imgs/under-const.gif differ