diff --git a/src/intro-to-parallel-comp/challenges.md b/src/intro-to-parallel-comp/challenges.md index 0d9c4ce..9dd157f 100644 --- a/src/intro-to-parallel-comp/challenges.md +++ b/src/intro-to-parallel-comp/challenges.md @@ -4,7 +4,7 @@ Make sure to clone a copy of **your** challenges repo onto M3, ideally in a personal folder on vf38_scratch. -> Note: For every challenge you will be running the programs as SLURM jobs. This is so we don't overload the login nodes. A template [SLURM job script](./job.slurm) is provided at the root of this directory which you can use to submit your own jobs to SLURM by copying it to each challenges sub-directory and filling in the missing details. You may need more than one for some challenges. This template will put the would-be-printed output in a file named `slurm-.out`. +> Note: For every challenge you will be running the programs as SLURM jobs. This is so we don't overload the login nodes. A template [SLURM job script](https://github.com/MonashDeepNeuron/HPC-Training-Challenges/blob/main/challenges/distributed-computing/job.slurm) is provided at the root of this directory which you can use to submit your own jobs to SLURM by copying it to each challenges sub-directory and filling in the missing details. You may need more than one for some challenges. This template will put the would-be-printed output in a file named `slurm-.out`. ## Task 1 - Single Cluster Job using OpenMP diff --git a/src/intro-to-parallel-comp/intro-to-parallel-comp.md b/src/intro-to-parallel-comp/intro-to-parallel-comp.md index 99a61e3..90929a5 100644 --- a/src/intro-to-parallel-comp/intro-to-parallel-comp.md +++ b/src/intro-to-parallel-comp/intro-to-parallel-comp.md @@ -4,6 +4,6 @@ As introduced in chapter 5, parallel computing is all about running instructions ![query-parallelism](./imgs/query-parallelism.png) -In this context, you can consider a query to be a job that carries out a series of steps on a particular dataset in order to achieve something e.g. a SORT query on a table. It's fairly straightforward to execute multiple queries at the same time using a parallel/distributed system but what if we want to parallelise and speed up the individual operations within a query? +In this context, you can consider a query to be a job that carries out a series of steps on a particular input in order to achieve something e.g. a SORT query on a table. It's fairly straightforward to execute multiple queries at the same time using a parallel/distributed system but what if we want to parallelise and speed up the individual operations within a query? This is where things like synchronisation, data/workload distribution and aggregation needs to be considered. In this chapter we will provide some theoretical context before learning how to implement parallelism using OpenMP & MPI. \ No newline at end of file diff --git a/src/intro-to-parallel-comp/locks.md b/src/intro-to-parallel-comp/locks.md index 78052d6..c845cd4 100644 --- a/src/intro-to-parallel-comp/locks.md +++ b/src/intro-to-parallel-comp/locks.md @@ -1,13 +1,13 @@ # Locks Ealier, we have learnt about how to write concurrent programs, as well as a few constructs to achieve **synchronisation** in OpenMP. We know that: -- `reduction construct` partitions shared data and uses barrier to achieve synchronisation +- `reduction construct` partitions shared data and used a barrier to achieve synchronisation - `atomic construct` utilises hardware ability to achieve thread-safe small memory read/write operations. What about `critical construct`? We said that it uses locks, but what are locks? > Notes that the direct use of locks is **not recommended** (at least in OpenMP): -> - It is very easy to cause deadlock or hard-to-debug livelock (more on these at the end of this sub-chapter). +> - It is very easy to cause a deadlock or hard-to-debug livelock (more on these at the end of this sub-chapter). > - It can often cause very poor performance or worse. > - It generally indicates that the program design is wrong. > diff --git a/src/intro-to-parallel-comp/message-passing.md b/src/intro-to-parallel-comp/message-passing.md index d893137..36b30ed 100644 --- a/src/intro-to-parallel-comp/message-passing.md +++ b/src/intro-to-parallel-comp/message-passing.md @@ -1,8 +1,8 @@ # Message Passing -As each processor has its own local memory with its own address space in distributed computing, we need a way to communicate between the processes and share data. Message passing is the mechanism of exchanging data across processes. Each process can communicate with one or more other processes by sending messages over a network. +As each processor has its own local memory with its own address space in distributed computing, we need a way to implement communication between the distributed processes and allow data sharing. Message passing is the mechanism of exchanging data between processes. Each process can communicate with one or more other processes by sending messages over a network. -The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments and are implemented by different groups with the main goals being high performance, scalability, and portability. +The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments. The main goals of this protocol standard is high performance, scalability, and portability. OpenMPI is one implementation of the MPI standard. It consists of a set of headers library functions that you call from your program. i.e. C, C++, Fortran etc. @@ -125,9 +125,13 @@ int MPI_Finalize(void); ``` -Use man pages to find out more about each routine +Terminology: +- **World Size**: The total no. of processes involved in your distributed computing job. +- **Rank**: A unique ID for a particular process. -When sending a Process it packs up all of its necessary data into a buffer for the receiving process. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office) +> Use OpenMPI man pages to find out more about each routine + +When sending data to a process, it packs up all of its necessary data into a buffer for the receiving process. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office) ### Elementary MPI Data types @@ -257,8 +261,9 @@ The command top or htop looks into a process. As you can see from the image belo - The command ```time``` checks the overall performance of the code - By running this command, you get real time, user time and system time. - Real is wall clock time - time from start to finish of the call. This includes the time of overhead -- User is the amount of CPU time spent outside the kernel within the process -- Sys is the amount of CPU time spent in the kernel within the process. + - User is the amount of CPU time spent outside the kernel within the process + - Sys is the amount of CPU time spent in the kernel within the process. - User time +Sys time will tell you how much actual CPU time your process used. + ![time](imgs/time.png) diff --git a/src/intro-to-parallel-comp/multithreading.md b/src/intro-to-parallel-comp/multithreading.md index 34a8dd9..aba9f99 100644 --- a/src/intro-to-parallel-comp/multithreading.md +++ b/src/intro-to-parallel-comp/multithreading.md @@ -1,32 +1,32 @@ # Multithreading -We have all looked at the theory of threads and concurrent programming in the Operating System chapter. Now, we will shift our focus to OpenMP and its application for executing multithreaded operations in a declarative programming style. +Hopefully by now you are all familiar with multi-threading and how parallel computing works. We'll now go through how to implement parallel computing using OpenMP in order to speed up the execution of our C programs. ## OpenMP -OpenMP is an Application Program Interface (API) that is used to explicitly direct multi-threaded, shared memory parallelism in C/C++ programs. It is not intrusive on the original serial code in that the OpenMP instructions are made in pragmas interpreted by the compiler. - -> Further features of OpenMP will be introduced in conjunction with the concepts discussed in later sub-chapters. +OpenMP is an Application Program Interface (API) that is used to implement multi-threaded, shared memory parallelism in C/C++ programs. It's designed to be a very minimal add-on to serial C code when it comes to implementation. All you have to do is use the `#pragma` (C preprocessor directives) mechanism to wrap the parallel regions of your code. ### Fork-Join Parallel Execution Model -OpenMP uses the `fork-join model` of parallel execution. +OpenMP uses the *fork-join model* of parallel execution. -* **FORK**: All OpenMP programs begin with a `single master thread` which executes sequentially until a `parallel region` is encountered, when it creates a team of parallel threads. +* **FORK**: All OpenMP programs begin with a *single master thread* which executes sequentially until a *parallel region* is encountered. After that, it spawns a *team of threads* to carry out the multi-threaded parallel computing. -The OpenMP runtime library maintains a pool of threads that can be added to the threads team in parallel regions. When a thread encounters a parallel construct and needs to create a team of more than one thread, the thread will check the pool and grab idle threads from the pool, making them part of the team. +The OpenMP runtime library maintains a pool of potential OS threads that can be added to the thread team during parallel region execution. When a thread encounters a parallel construct (pragma directive) and needs to create a team of more than one thread, the thread will check the pool and grab idle threads from the pool, making them part of the team. -* **JOIN**: Once the team threads complete the parallel region, they `synchronise` and return to the pool, leaving only the master thread that executes sequentially. +This speeds up the process of thread spawning by using a *warm start* mechanism to minimise overhead associated with the kernel scheduler context switching needed to conduct thread spawning. -![Fork - Join Model](./imgs/fork-join.png) +> If you're unclear how the kernel scheduler context switching works, revisit the operating systems chapter and explore/lookup the topics introduced there. + +* **JOIN**: Once the team of threads complete the parallel region, they **synchronise** and return to the pool, leaving only the master thread that executes sequentially. -> We will look a bit more into what is synchronisation as well as synchronisation techniques in the next sub-chapter. +![Fork - Join Model](./imgs/fork-join.png) ### Imperative vs Declarative -Imperative programming specifies and directs the control flow of the program. On the other hand, declarative programming specifies the expected result and core logic without directing the program's control flow. +Imperative programming specifies and directs the control flow of the program. On the other hand, declarative programming specifies the expected result and core logic without directing the program's control flow i.e. you tell the computer what to do instead of *how to do it*. -OpenMP follows a declarative programming style. Instead of manually creating, managing, synchronizing, and terminating threads, we can achieve the desired outcome by simply declaring it using pragma. +OpenMP follows a declarative programming style. Instead of manually creating, managing, synchronizing, and terminating threads, we can achieve the desired outcome by simply declaring pragma directives in our code. ![Structure Overview](./imgs/program-structure.png) @@ -52,24 +52,22 @@ int main() { ## Running on M3 -Here is a template script provided in the home directory in M3. Notice that we can dynamically change the number of threads using `export OMP_NUM_THREADS=12` +Here is a template script provided in the home directory in M3. Notice that we can dynamically change the number of threads using `export OMP_NUM_THREADS=12`. + +> The `export` statement is a bash command you can type into a WSL/Linux terminal. It allows you to set environment variables in order to manage runtime configuration. ```bash #!/bin/bash # Usage: sbatch slurm-openmp-job-script # Prepared By: Kai Xi, Apr 2015 -# help@massive.org.au - # NOTE: To activate a SLURM option, remove the whitespace between the '#' and 'SBATCH' # To give your job a name, replace "MyJob" with an appropriate name # SBATCH --job-name=MyJob - # To set a project account for credit charging, # SBATCH --account=pmosp - # Request CPU resource for a openmp job, suppose it is a 12-thread job # SBATCH --ntasks=1 # SBATCH --ntasks-per-node=1 @@ -81,24 +79,20 @@ Here is a template script provided in the home directory in M3. Notice that we c # Set your minimum acceptable walltime, format: day-hours:minutes:seconds # SBATCH --time=0-06:00:00 - # To receive an email when job completes or fails # SBATCH --mail-user= # SBATCH --mail-type=END # SBATCH --mail-type=FAIL - # Set the file for output (stdout) # SBATCH --output=MyJob-%j.out # Set the file for error log (stderr) # SBATCH --error=MyJob-%j.err - # Use reserved node to run job when a node reservation is made for you already # SBATCH --reservation=reservation_name - # Command to run a openmp job # Set OMP_NUM_THREADS to the same value as: --cpus-per-task=12 export OMP_NUM_THREADS=12 diff --git a/src/intro-to-parallel-comp/parallel-algos.md b/src/intro-to-parallel-comp/parallel-algos.md index 501ef3d..6d2eb24 100644 --- a/src/intro-to-parallel-comp/parallel-algos.md +++ b/src/intro-to-parallel-comp/parallel-algos.md @@ -37,4 +37,4 @@ This is how one of the simplest parallel algorithms - **parallel sum** works. Al ![parallel-sum](./imgs/parallel-sum-diagram.png) -Besides the difference between serial & parallel regions another important concept to note here is **partitioning** aka. chunking. Often when you're parallelising your serial algorithm you will have to define local, parallel tasks that will execute on different parts of your dataset simultaneously in order to acheive a speedup. This can be anything from a sum operation in this case, to a local/serial sort or even as complex as the training of a CNN model on a particular batch of images. \ No newline at end of file +Besides the difference between serial & parallel regions another important concept to note here is **partitioning** aka. chunking. Often when you're parallelising your serial algorithm you will have to define local, parallel tasks that will execute on different parts of your input simultaneously in order to acheive a speedup. This can be anything from a sum operation in this case, to a local/serial sort or even as complex as the training of a CNN model on a particular batch of images. \ No newline at end of file diff --git a/src/intro-to-parallel-comp/synchronisation.md b/src/intro-to-parallel-comp/synchronisation.md index 132f56f..d89d372 100644 --- a/src/intro-to-parallel-comp/synchronisation.md +++ b/src/intro-to-parallel-comp/synchronisation.md @@ -1,8 +1,11 @@ # Synchronisation -Definition: Synchronisation is the task of coordinating multiple of processes (or threads) to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action. +Synchronisation is the task of coordinating multiple processes (or threads) to join up or handshake at a certain point, in order to reach an agreement or commit to a certain sequence of action. This is important so that the different threads/processes you spawn in the parallel region don't contradict each other and corrupt your code. ## Race Condition +A race condition is one of the most popular forms of this synchronisation corruption. It's essentially when your threads are in a "race" against each other to access a particular resource (eg. an int variable's value) and the loser's access/update to that resource is lost. + +> Instead of your threads fighting each other, you want them to work together perfectly synchronised i.e. more like an F1 pitstop crew than a toxic office place. Let's start with this simple program: @@ -64,9 +67,9 @@ The output should look something like this: ![1 thread counter](./imgs/one-thread-counter.png) -The output coordinates with what we expected. +The program works great. No corruption at all. - This is because we only used 1 single thread. -- The program is just a sequential program without any parallism. +- The program is just a serial program without any parallelism. - `sleep()` calls simply put the thread to sleep, that same thread will go to sleep, wake up, and continue the execution. ### Multiple Threads @@ -76,7 +79,9 @@ export OMP_NUM_THREADS=2 ./counter ``` -Running the program using 2 threads may give us this output (this is just 1 **possible** output): +Running the program using 2 threads may give us this output: + +> Note: This is just one possible output. ![alt text](./imgs/two-threads-counter.png) @@ -86,13 +91,13 @@ What is happening here? - During the time when 1 thread is sleeping, the other thread may increment the shared counter. - The 2 threads simply go on their way and not coordinate with each other. -> What we are having here is `Race Condition`. A race condition occurs when two or more threads can access `shared data` and they try to `change it at the same time`. +> This is what a **Race Condition** is. A race condition occurs when two or more threads can access **shared data** and they try to **change it at the same time**. -### How to resolve the problem? +### How to prevent race conditions in OpenMP? -There are a few ways we can resolve the race condition in OpenMP: +There are a few approaches we can take: -* **Critical construct**: This restricts the code so that only one thread can do something at a time (in our example, only 1 thread can increment the counter at a time). However, it is `bad for performance` and possibly destroy a lot of the gains from running code in parallel in the first place. +* **Critical construct**: This restricts the code so that only one thread can do something at a time (in our example, only 1 thread can increment the counter at a time). It's used to specify a **critical region** which is another term for serial execution. ```c int main() { @@ -106,7 +111,9 @@ int main() { } ``` -* **Atomic construct**: This is quite similar to Critical construct, however, it only applies to memory read/write operations. It has a better performance than the Critical construct by taking advantage on the hardware. There's no lock/unlock needed on entering/exiting the line of code, it just does the atomic operation which the hardware tells you can't be interfered with. Let's look at another example: +This is unfortunatley not appropriate for some situations since it is bad for performance and destroys a lot of the speed-up we're trying to acheive in the first place. + +* **Atomic construct**: This is quite similar to the critical construct, however it only applies to memory read/write operations. It has a better performance than the critical construct by taking advantage of the hardware. There's no lock/unlock needed on entering/exiting the line of code, it just does the atomic operation which the hardware tells you can't be interfered with. Let's look at another example: > Run this program multiple times using multiple threads (before uncommenting the construct). Again, race condition! @@ -187,7 +194,7 @@ int main() { ## Barrier Synchronisation -In the last sub-chapter, we have talked about the [Fork - Join Model](./multithreading.md#fork-join-parallel-execution-model). We know that **"Once the team threads complete the parallel region, they `synchronise` and return to the pool, leaving only the master thread that executes sequentially."**. However, there are a few important aspects that we have left out: +In the last sub-chapter, we have talked about the [Fork - Join Model](./multithreading.md#fork-join-parallel-execution-model). We know that **"Once the team of threads complete the parallel region, they `synchronise` and return to the pool, leaving only the master thread that executes serially."**. However, there are a few important aspects that we have left out: - The time taken to finish the assigned task is **different** for each thread. - How can OpenMP know/identify **when** a thread has completed its own task. - How can OpenMP know/identify **when** all threads have finished all the tasks. @@ -202,7 +209,7 @@ The answer lies in something called **Barrier Synchronisation**. Here are illust ### Implicit Barriers -The barrier synchronisation implicitly (behind the scene) occur at the end of constructs such as parallel construct ("`#pragma omp parallel`") and the end of worksharing constructs(loop, sections, single, and workshare constructs). +The barrier synchronisation implicitly (behind the scenes) occur at the end of constructs (regions of code) such as parallel constructs ("`#pragma omp parallel`") and the end of worksharing constructs(loop, sections, single, and workshare constructs). ```c #include @@ -254,4 +261,4 @@ We don't need to know exactly how OpenMP implemented this feature, at least not - We also need a mechanism to make a finished thread idle and **wait()** for other threads to finish. - The last thread to finish has the responsibility of **notify()** other threads (threads that you want to be executed after the barrier). -Voila! we have a barrier. We will implement barrier as part of a mini-project using [Posix Thread](https://docs.oracle.com/cd/E26502_01/html/E35303/tlib-1.html). +Voila! we have a barrier. \ No newline at end of file