Skip to content

Commit

Permalink
Merge pull request #93 from MonashDeepNeuron/dev
Browse files Browse the repository at this point in the history
Dev
  • Loading branch information
linton2000 committed Apr 26, 2024
2 parents 77317f3 + f3b5f6d commit b5fec9b
Show file tree
Hide file tree
Showing 13 changed files with 83 additions and 62 deletions.
13 changes: 2 additions & 11 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,7 @@

- [M3 & SLURM](./m3-slurm/m3-slurm.md)

- [Batch Processing vs. Cloud Computing](./m3-slurm/batch-cloud.md)
- [Parallel & Distributed Computing](./m3-slurm/parallel-distributed.md)
- [What is HPC really?](./m3-slurm/hpc-intro.md)
- [M3 Login - SSH & Strudel](./m3-slurm/login.md)
- [Intro to SLURM](./m3-slurm/slurm_intro.md)
- [M3 Interface & Usage](./m3-slurm/m3-interface.md)
Expand All @@ -51,21 +50,13 @@

- [Intro to Parallel Computing](./intro-to-parallel-comp/intro-to-parallel-comp.md)

- [Parallel Algorithms](./intro-to-parallel-comp/parallel-algos.md)
- [OpenMP: Multithreading](./intro-to-parallel-comp/multithreading.md)
- [Synchronisation Issues](./intro-to-parallel-comp/synchronisation.md)
- [Dead & Live Locks](./intro-to-parallel-comp/locks.md)
- [MPI: Message Passing](./intro-to-parallel-comp/message-passing.md)
- [Challenges](./intro-to-parallel-comp/challenges.md)

- [Parallellisation of Algorithms](./parallel-algos/parallel-algos.md)

- [Parallel Search](./parallel-algos/parallel-search.md)
- [Parallel Sort](./parallel-algos/parallel-sort.md)
- [Other Parallel Algorithms](./parallel-algos/other-parallel-algos.md)
- [Machine Learning & HPC](./parallel-algos/machine-learning-and-hpc.md)
- [Optimisation Algorithms](./parallel-algos/optim-algos.md)
- [Challenges](./parallel-algos/challenges.md)

- [Apache Spark](./apache-spark/apache-spark.md)
- [Installation & Cluster Set-up](./apache-spark/set-up.md)
- [Internal Architecture](./apache-spark/internals.md)
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 5 additions & 3 deletions src/intro-to-parallel-comp/intro-to-parallel-comp.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Parallel Computing

In this chapter, we will discuss the abstraction of parallel computing. To facilitate our exploration, we will employ a API within the C Programming Language: OpenMP. This tool will serve as a means to concretely illustrate the underlying language-independent theory.
As introduced in chapter 5, parallel computing is all about running instructions simultaneously on multiple computers rather than doing it all sequentially/serially on the same computer. This is relatively straightforward if we have multiple, completely independent tasks that don't need to share resources or data i.e. inter-query parallelism.

**Parallel computing is about executing the instructions of the program simultaneously.**
![query-parallelism](./imgs/query-parallelism.png)

One of the core values of computing is the breaking down of a big problem into smaller easier to solve problems, or at least smaller problems. In some cases, the steps required to solve the problem can be executed simultaneously (in parallel) rather than sequentially (in order).
In this context, you can consider a query to be a job that carries out a series of steps on a particular dataset in order to achieve something e.g. a SORT query on a table. It's fairly straightforward to execute multiple queries at the same time using a parallel/distributed system but what if we want to parallelise and speed up the individual operations within a query?

This is where things like synchronisation, data/workload distribution and aggregation needs to be considered. In this chapter we will provide some theoretical context before learning how to implement parallelism using OpenMP & MPI.
40 changes: 40 additions & 0 deletions src/intro-to-parallel-comp/parallel-algos.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Parallel Algorithms

You can think of all parallel algorithms as having a serial portion and a parallel portion i.e. local and global phases.

![serial-vs-parallel](./imgs/serial-parallel-parts.png)

> This applies to both local parallel computing between CPU cores with a shared RAM as well as distributed computing between multiple HPC nodes with a distributed memory architecture. The only difference between the two is additional complexities involved in managing data sharing & sending instructions across a network.
Let's go through this with a simple example. To calculate the sum of all numbers between 1 to N serially, you would execute the following pseudocode,

```
function sumOfNumbers(N):
result = 0
for x from 1 to N:
result += x
return result
```

To do this in parallel (assuming you have M processors/nodes) you would do something like this,

```
function parallelSum(N):
globalResult = 0
partition_size = N // M
for node from 1 to M:
partition = node * partition_size
localResult = sendJobToNode(node, sumOfNumbers(partition))
globalResult += localResult
return globalResult
```

This is how one of the simplest parallel algorithms - **parallel sum** works. All lines of code beside the `sendJobToNode` function call are executed serially on the master node/thread. This is all illustrated in the diagram below.

![parallel-sum](./imgs/parallel-sum-diagram.png)

Besides the difference between serial & parallel regions another important concept to note here is **partitioning** aka. chunking. Often when you're parallelising your serial algorithm you will have to define local, parallel tasks that will execute on different parts of your dataset simultaneously in order to acheive a speedup. This can be anything from a sum operation in this case, to a local/serial sort or even as complex as the training of a CNN model on a particular batch of images.
29 changes: 0 additions & 29 deletions src/m3-slurm/batch-cloud.md

This file was deleted.

2 changes: 1 addition & 1 deletion src/m3-slurm/challenges.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Let this run fully. Check the output of the script to make sure it ran correctly
## Challenge 7

Edit your submission script so that you get a gpu node, and run the script using the gpu.
> Hint: Use the m3h partition
> Hint: Use the m3g partition
## Challenge 8

Expand Down
53 changes: 35 additions & 18 deletions src/m3-slurm/parallel-distributed.md → src/m3-slurm/hpc-intro.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,42 @@
# Parallel & Distributed Computing
# What is HPC really?

You are all likely familiar with the definition of High Performance Computing. Here is one from IBM,

> High-performance computing (HPC) is technology that uses clusters of powerful processors that work in parallel to process massive multi-dimensional data sets, also known as big data, and solve complex problems at extremely high speeds. HPC solves some of today’s most complex computing problems in real time.
But the term HPC is not really used much outside the scientific research community. A lot of cloud systems involve similar scale of hardware, parallel & distributed computing, similar computational workload, data processing capacity and low latency/high throughput capability as HPC clusters. *So what exactly is the difference between a cloud system and a HPC cluster?*

At the end of the day this comes down to semantics but a key difference is that a HPC cluster implies a system primarily used for **batch processing** whereas a cloud system would involve **interactive processing**.

### Batch Processing vs. Cloud Computing

The vast majority of computer systems and nearly 100% of the ones that the average person uses is a cloud-based interactive system. Due to the nature of use cases specific to researchers, batch processing is a much more suitable choice for them.

__Batch Processing:__
- Jobs (code scripts) submitted are executed at a later time.
- User can't interact (or only limited interaction).
- Performance measure is **throughput**.
- Snapshot of output is used for debugging.

![batch-image](./imgs/batch-processing.jpeg)

__Interactive Processing:__
- Jobs submitted are executed immediately.
- User can interact.
- Performance measure is **response time**.
- Interactive debugging.

![interactive-image](./imgs/interactive-processing.png)

## Parallel & Distributed Computing

Nearly all modern computer systems utilise parallel computing to speed up the execution of algorithms. To see how this works in practice look at the diagram below.

![parallel vs. distributed](imgs/parallel-distributed.png)

As you can see, in a scenario where a program (job) takes 3 seconds and 3 independent jobs have to be executed by a system, doing it serially in a single computer takes a total of 9 seconds. But doing it simultaneously across 3 computers will only take 3 seconds thus achieving a 3x speedup through parallel computing.
As you can see, in a scenario where a program (job) takes 3 seconds and 3 independent jobs have to be executed by a system, doing it serially in a single processor (computer) takes a total of 9 seconds. But doing it simultaneously across 3 processors will only take 3 seconds thus achieving a 3x speedup through parallel computing. This parallel computing is performed locally in a **multi-processing** systems with more than 1 CPU core (processor).

![multi-processing](imgs/Multiprocessor-System.png)

This is the fundamental principle that High Performance Computing is based on. The trouble (or fun) is when your tasks have dependencies on each other which is gonna be the case for the vast majority of algorithms. That's when things like synchronisation issues, data sharing and all of that comes into play - which we'll explore in later chapters.

Expand Down Expand Up @@ -52,19 +84,4 @@ And finally, not everything needs to be done on a parallel or distributed system
### Advantages of serial computing:
- **More simple** to design & implement algorithms. Parallel algorithms can get quite complex, especially when dealing with more complicated instructions with dependencies.
- **Less overhead** involved in managing a parallel & distributed job. No need to manage data sharing between threads, processes, nodes, etc...
- **No synchronisation issues** & headaches involved in concurrent computing. Don't have to deal with race conditions, deadlocks, livelocks, etc...

## Parallel Scalability
The speed up achieved from parallelism is dictated by your algorithm. Notably the serial parts of your algorithm can not be sped up by increasing the number of processors. The diagram below looks at the benefits we can achieve from writing parallel code as the number of processes increases.

![amdahl](./imgs/parallel_scalability.jpg)

Amdahl's Law, formulated by computer architect Gene Amdahl in 1967, is a principle used to analyze the potential speedup of parallel computing. It states that the speedup of a program from parallelization is limited by the proportion of the program that must be executed serially. In other words, it helps to determine the maximum performance improvement that can be achieved by using parallel processing.

The implications of Amdahl's Law for HPC is very significant:

- **Limitation of Speedup:** Amdahl's Law highlights that even with an increase in the number of processors (parallelization), the overall speedup is limited by the sequential portion of the code. Thus, if a significant portion of the code is inherently serial, the potential speedup achievable through parallelization is restricted.
- **Importance of Identifying Serial Sections:** In HPC, it's crucial to identify the sections of code that are inherently serial and cannot be parallelized. Optimizing these sections can lead to better overall performance. Conversely, focusing solely on parallelizing code without addressing these serial bottlenecks can result in suboptimal speedup.
- **Efficiency vs. Scalability:** Amdahl's Law emphasizes the importance of balancing efficiency and scalability in parallel computing. While increasing the number of processors can improve performance to a certain extent, beyond a certain point, diminishing returns occur due to the overhead of synchronization, communication, and managing parallel tasks.
- **Architectural Considerations:** HPC system architects must consider Amdahl's Law when designing hardware and software architectures. Designing systems that minimize the impact of serial portions of code and optimize parallel execution can lead to better overall performance.
- **Algorithm Selection:** When choosing algorithms for HPC applications, it's essential to consider their parallelizability. Algorithms that can be efficiently parallelized are more suitable for HPC environments, as they can leverage the potential for speedup provided by parallel computing resources more effectively.
- **No synchronisation issues** & headaches involved in concurrent computing. Don't have to deal with race conditions, deadlocks, livelocks, etc...
Binary file added src/m3-slurm/imgs/Multiprocessor-System.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/m3-slurm/imgs/sinfo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/m3-slurm/imgs/squeue.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/m3-slurm/imgs/srun-sacct.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b5fec9b

Please sign in to comment.