Skip to content

Commit

Permalink
Merge pull request #95 from MonashDeepNeuron/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
linton2000 authored May 2, 2024
2 parents b5fec9b + 5bc81b6 commit 954f7ce
Show file tree
Hide file tree
Showing 7 changed files with 50 additions and 44 deletions.
2 changes: 1 addition & 1 deletion src/intro-to-parallel-comp/challenges.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Make sure to clone a copy of **your** challenges repo onto M3, ideally in a personal folder on vf38_scratch.

> Note: For every challenge you will be running the programs as SLURM jobs. This is so we don't overload the login nodes. A template [SLURM job script](./job.slurm) is provided at the root of this directory which you can use to submit your own jobs to SLURM by copying it to each challenges sub-directory and filling in the missing details. You may need more than one for some challenges. This template will put the would-be-printed output in a file named `slurm-<job-name>.out`.
> Note: For every challenge you will be running the programs as SLURM jobs. This is so we don't overload the login nodes. A template [SLURM job script](https://github.com/MonashDeepNeuron/HPC-Training-Challenges/blob/main/challenges/distributed-computing/job.slurm) is provided at the root of this directory which you can use to submit your own jobs to SLURM by copying it to each challenges sub-directory and filling in the missing details. You may need more than one for some challenges. This template will put the would-be-printed output in a file named `slurm-<job-name>.out`.
## Task 1 - Single Cluster Job using OpenMP

Expand Down
2 changes: 1 addition & 1 deletion src/intro-to-parallel-comp/intro-to-parallel-comp.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,6 @@ As introduced in chapter 5, parallel computing is all about running instructions

![query-parallelism](./imgs/query-parallelism.png)

In this context, you can consider a query to be a job that carries out a series of steps on a particular dataset in order to achieve something e.g. a SORT query on a table. It's fairly straightforward to execute multiple queries at the same time using a parallel/distributed system but what if we want to parallelise and speed up the individual operations within a query?
In this context, you can consider a query to be a job that carries out a series of steps on a particular input in order to achieve something e.g. a SORT query on a table. It's fairly straightforward to execute multiple queries at the same time using a parallel/distributed system but what if we want to parallelise and speed up the individual operations within a query?

This is where things like synchronisation, data/workload distribution and aggregation needs to be considered. In this chapter we will provide some theoretical context before learning how to implement parallelism using OpenMP & MPI.
4 changes: 2 additions & 2 deletions src/intro-to-parallel-comp/locks.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Locks

Ealier, we have learnt about how to write concurrent programs, as well as a few constructs to achieve **synchronisation** in OpenMP. We know that:
- `reduction construct` partitions shared data and uses barrier to achieve synchronisation
- `reduction construct` partitions shared data and used a barrier to achieve synchronisation
- `atomic construct` utilises hardware ability to achieve thread-safe small memory read/write operations.

What about `critical construct`? We said that it uses locks, but what are locks?

> Notes that the direct use of locks is **not recommended** (at least in OpenMP):
> - It is very easy to cause deadlock or hard-to-debug livelock (more on these at the end of this sub-chapter).
> - It is very easy to cause a deadlock or hard-to-debug livelock (more on these at the end of this sub-chapter).
> - It can often cause very poor performance or worse.
> - It generally indicates that the program design is wrong.
>
Expand Down
17 changes: 11 additions & 6 deletions src/intro-to-parallel-comp/message-passing.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Message Passing

As each processor has its own local memory with its own address space in distributed computing, we need a way to communicate between the processes and share data. Message passing is the mechanism of exchanging data across processes. Each process can communicate with one or more other processes by sending messages over a network.
As each processor has its own local memory with its own address space in distributed computing, we need a way to implement communication between the distributed processes and allow data sharing. Message passing is the mechanism of exchanging data between processes. Each process can communicate with one or more other processes by sending messages over a network.

The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments and are implemented by different groups with the main goals being high performance, scalability, and portability.
The MPI (message passing interface) in OpenMPI is a communication protocol standard defining message passing between processors in distributed environments. The main goals of this protocol standard is high performance, scalability, and portability.

OpenMPI is one implementation of the MPI standard. It consists of a set of headers library functions that you call from your program. i.e. C, C++, Fortran etc.

Expand Down Expand Up @@ -125,9 +125,13 @@ int MPI_Finalize(void);

```
Use man pages to find out more about each routine
Terminology:
- **World Size**: The total no. of processes involved in your distributed computing job.
- **Rank**: A unique ID for a particular process.
When sending a Process it packs up all of its necessary data into a buffer for the receiving process. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office)
> Use OpenMPI man pages to find out more about each routine
When sending data to a process, it packs up all of its necessary data into a buffer for the receiving process. These buffers are often referred to as envelopes since the data is being packed into a single message before transmission (similar to how letters are packed into envelopes before transmission to the post office)
### Elementary MPI Data types
Expand Down Expand Up @@ -257,8 +261,9 @@ The command top or htop looks into a process. As you can see from the image belo
- The command ```time``` checks the overall performance of the code
- By running this command, you get real time, user time and system time.
- Real is wall clock time - time from start to finish of the call. This includes the time of overhead
- User is the amount of CPU time spent outside the kernel within the process
- Sys is the amount of CPU time spent in the kernel within the process.
- User is the amount of CPU time spent outside the kernel within the process
- Sys is the amount of CPU time spent in the kernel within the process.
- User time +Sys time will tell you how much actual CPU time your process used.


![time](imgs/time.png)
36 changes: 15 additions & 21 deletions src/intro-to-parallel-comp/multithreading.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,32 @@
# Multithreading

We have all looked at the theory of threads and concurrent programming in the Operating System chapter. Now, we will shift our focus to OpenMP and its application for executing multithreaded operations in a declarative programming style.
Hopefully by now you are all familiar with multi-threading and how parallel computing works. We'll now go through how to implement parallel computing using OpenMP in order to speed up the execution of our C programs.

## OpenMP

OpenMP is an Application Program Interface (API) that is used to explicitly direct multi-threaded, shared memory parallelism in C/C++ programs. It is not intrusive on the original serial code in that the OpenMP instructions are made in pragmas interpreted by the compiler.

> Further features of OpenMP will be introduced in conjunction with the concepts discussed in later sub-chapters.
OpenMP is an Application Program Interface (API) that is used to implement multi-threaded, shared memory parallelism in C/C++ programs. It's designed to be a very minimal add-on to serial C code when it comes to implementation. All you have to do is use the `#pragma` (C preprocessor directives) mechanism to wrap the parallel regions of your code.

### Fork-Join Parallel Execution Model

OpenMP uses the `fork-join model` of parallel execution.
OpenMP uses the *fork-join model* of parallel execution.

* **FORK**: All OpenMP programs begin with a `single master thread` which executes sequentially until a `parallel region` is encountered, when it creates a team of parallel threads.
* **FORK**: All OpenMP programs begin with a *single master thread* which executes sequentially until a *parallel region* is encountered. After that, it spawns a *team of threads* to carry out the multi-threaded parallel computing.

The OpenMP runtime library maintains a pool of threads that can be added to the threads team in parallel regions. When a thread encounters a parallel construct and needs to create a team of more than one thread, the thread will check the pool and grab idle threads from the pool, making them part of the team.
The OpenMP runtime library maintains a pool of potential OS threads that can be added to the thread team during parallel region execution. When a thread encounters a parallel construct (pragma directive) and needs to create a team of more than one thread, the thread will check the pool and grab idle threads from the pool, making them part of the team.

* **JOIN**: Once the team threads complete the parallel region, they `synchronise` and return to the pool, leaving only the master thread that executes sequentially.
This speeds up the process of thread spawning by using a *warm start* mechanism to minimise overhead associated with the kernel scheduler context switching needed to conduct thread spawning.

![Fork - Join Model](./imgs/fork-join.png)
> If you're unclear how the kernel scheduler context switching works, revisit the operating systems chapter and explore/lookup the topics introduced there.
* **JOIN**: Once the team of threads complete the parallel region, they **synchronise** and return to the pool, leaving only the master thread that executes sequentially.

> We will look a bit more into what is synchronisation as well as synchronisation techniques in the next sub-chapter.
![Fork - Join Model](./imgs/fork-join.png)

### Imperative vs Declarative

Imperative programming specifies and directs the control flow of the program. On the other hand, declarative programming specifies the expected result and core logic without directing the program's control flow.
Imperative programming specifies and directs the control flow of the program. On the other hand, declarative programming specifies the expected result and core logic without directing the program's control flow i.e. you tell the computer what to do instead of *how to do it*.

OpenMP follows a declarative programming style. Instead of manually creating, managing, synchronizing, and terminating threads, we can achieve the desired outcome by simply declaring it using pragma.
OpenMP follows a declarative programming style. Instead of manually creating, managing, synchronizing, and terminating threads, we can achieve the desired outcome by simply declaring pragma directives in our code.

![Structure Overview](./imgs/program-structure.png)

Expand All @@ -52,24 +52,22 @@ int main() {

## Running on M3

Here is a template script provided in the home directory in M3. Notice that we can dynamically change the number of threads using `export OMP_NUM_THREADS=12`
Here is a template script provided in the home directory in M3. Notice that we can dynamically change the number of threads using `export OMP_NUM_THREADS=12`.

> The `export` statement is a bash command you can type into a WSL/Linux terminal. It allows you to set environment variables in order to manage runtime configuration.
```bash
#!/bin/bash
# Usage: sbatch slurm-openmp-job-script
# Prepared By: Kai Xi, Apr 2015
# [email protected]

# NOTE: To activate a SLURM option, remove the whitespace between the '#' and 'SBATCH'

# To give your job a name, replace "MyJob" with an appropriate name
# SBATCH --job-name=MyJob


# To set a project account for credit charging,
# SBATCH --account=pmosp


# Request CPU resource for a openmp job, suppose it is a 12-thread job
# SBATCH --ntasks=1
# SBATCH --ntasks-per-node=1
Expand All @@ -81,24 +79,20 @@ Here is a template script provided in the home directory in M3. Notice that we c
# Set your minimum acceptable walltime, format: day-hours:minutes:seconds
# SBATCH --time=0-06:00:00


# To receive an email when job completes or fails
# SBATCH --mail-user=<You Email Address>
# SBATCH --mail-type=END
# SBATCH --mail-type=FAIL


# Set the file for output (stdout)
# SBATCH --output=MyJob-%j.out

# Set the file for error log (stderr)
# SBATCH --error=MyJob-%j.err


# Use reserved node to run job when a node reservation is made for you already
# SBATCH --reservation=reservation_name


# Command to run a openmp job
# Set OMP_NUM_THREADS to the same value as: --cpus-per-task=12
export OMP_NUM_THREADS=12
Expand Down
2 changes: 1 addition & 1 deletion src/intro-to-parallel-comp/parallel-algos.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ This is how one of the simplest parallel algorithms - **parallel sum** works. Al

![parallel-sum](./imgs/parallel-sum-diagram.png)

Besides the difference between serial & parallel regions another important concept to note here is **partitioning** aka. chunking. Often when you're parallelising your serial algorithm you will have to define local, parallel tasks that will execute on different parts of your dataset simultaneously in order to acheive a speedup. This can be anything from a sum operation in this case, to a local/serial sort or even as complex as the training of a CNN model on a particular batch of images.
Besides the difference between serial & parallel regions another important concept to note here is **partitioning** aka. chunking. Often when you're parallelising your serial algorithm you will have to define local, parallel tasks that will execute on different parts of your input simultaneously in order to acheive a speedup. This can be anything from a sum operation in this case, to a local/serial sort or even as complex as the training of a CNN model on a particular batch of images.
Loading

0 comments on commit 954f7ce

Please sign in to comment.