Skip to content

Commit

Permalink
revision
Browse files Browse the repository at this point in the history
  • Loading branch information
ssvassiliev committed Jun 11, 2024
1 parent 205a94f commit ee6e593
Show file tree
Hide file tree
Showing 7 changed files with 62 additions and 124 deletions.
8 changes: 4 additions & 4 deletions _episodes/01-intro_to_openMP.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,17 +201,17 @@ free(A); // Free memory
{: .callout}

>## Training cluster
>The default environment on the training cluster is StdEnv/2020. If you are using the training cluster run the following command:
>The default environment on the training cluster is StdEnv/2020. We will use gcc/9.3.0
>~~~
>module load StdEnv/2023
>module load StdEnv/2020 gcc/9.3.0
>~~~
> - Intel compilers are not available on the traiining cluster
>{: .language-bash}
{: .callout}

The following is an example of a simple hello world program written in C.
~~~
/* --- File hello_world.c --- */
/* --- File hello_serial.c --- */
#include <stdio.h>
#include <stdlib.h>
Expand All @@ -223,7 +223,7 @@ int main(int argc, char **argv) {

In order to compile this code, you would need to use the following command:
~~~
gcc -o hello hello_world.c
gcc -o hello hello_serial.c
~~~
{: .language-bash}

Expand Down
4 changes: 0 additions & 4 deletions _episodes/02-hello_world.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,8 +151,6 @@ Since threads are a programming abstraction, there is no direct relationship bet
- The function `omp_get_thread_num()` returns the thread ID of the currently running process.
{: .self_study_text}

<div class="gitfile" markdown="1">

~~~
...
#pragma omp parallel
Expand All @@ -165,8 +163,6 @@ Since threads are a programming abstraction, there is no direct relationship bet
~~~
{: .language-c}

hello_omp.c
</div>

- Another useful function is `omp_get_num_threads()`, which returns the number of threads.
{: .self_study_text}
Expand Down
16 changes: 9 additions & 7 deletions _episodes/04-array_operations.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ int main(int argc, char **argv)
}
~~~
{: .language-c}
[array_multiply_template.c](https://github.com/ssvassiliev/ACENET_Summer_School_OpenMP_2023/raw/gh-pages/code/array_multiply_template.c)
[array_multiply_serial.c](https://github.com/ssvassiliev/ACENET_Summer_School_OpenMP_2023/raw/gh-pages/code/array_multiply_serial.c)
</div>

- The omp_get_wtime( ) function is used to determine the start and end times for the loop.
Expand All @@ -63,7 +63,7 @@ int main(int argc, char **argv)
### Compiling and Running a Serial Version
Compile the program.
~~~
gcc array_multiply_template.c -o array_multiply_serial -fopenmp
gcc array_multiply_serial.c -o array_multiply_serial -fopenmp
~~~
{:.language-bash}
Run it on the cluster.
Expand Down Expand Up @@ -329,8 +329,7 @@ For more info, see [Thread safety](https://en.wikipedia.org/wiki/Thread_safety).
#### CPU Cache and Data Locality
Let's do a quick experiment. Compile our matrix_multiply_omp.c code with Intel compiler:
~~~
module load intel/2022.1.0
icc matrix_multiply_omp.c -gopenmp
gcc matrix_multiply_omp.c -fopenmp -O1
~~~
{:.language-bash}

Expand All @@ -341,7 +340,7 @@ srun --mem-per-cpu=2000 -c4 ./a.out
~~~
{:.language-bash}
~~~
Total is 100000000, time is 0.021937 s
Total is 100000000, time is 0.014977 s
~~~
{:.output}

Expand All @@ -359,7 +358,7 @@ Then swap *i* and *j* indexes in the main for loop:
Recompile the program and run it again.

~~~
Total is 100000000, time is 0.676217 s
Total is 100000000, time is 0.391304 s
~~~
{:.output}

Expand All @@ -374,7 +373,7 @@ If inner loop iterates through elements of a row then chunk of a row is loaded i
{:.self_study_text}

~~~
inner loop
inner loop j (rows)
j=1 j=2
1 2 ... 1000 ... --> 1 2 ... 1000
1001 1002 ... 2000 ... --> 1001 1002 ... 2000
Expand All @@ -384,6 +383,7 @@ If inner loop iterates through elements of a row then chunk of a row is loaded i

This will not happen if in the inner loop we iterate through elements of a column (loop variable *i*) because columns are not contiguous memory blocks:
~~~
inner loop i (columns)
i=1 i=2
1 2 ... 1000 ... --> 1 1001 2001 ... 2 1002 2002 ...
1001 1002 ... 2000 ...
Expand All @@ -406,3 +406,5 @@ for(i = 0; i < N; i++){

#### Minimizing Parallelization Overhead
If the inner loop is parallelized, in each iteration step of the outer loop, a parallel region is created. This causes parallelization overhead.

- It's better to create fewer threads with longer tasks.
82 changes: 32 additions & 50 deletions _episodes/05-calculating-areas.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,14 @@ keypoints:
- "The best option to parallelize summation is to use the *reduction* directive"
---
## Introduction
A data race occurs when two threads access the same memory without proper synchronization. This can cause the program to produce incorrect results in parallel mode.
- data race occurs when two threads access the same memory without proper synchronization.
- data race can cause a program to produce incorrect results in parallel mode.

As we have learned, loops are parallelized by assigning different loop iterations to different threads. Because loop iterations can run in any order when two threads write to a shared variable in a parallel region the final value of the variable depends on which iteration writes last. In sequential mode, this is always the last iteration, but in parallel mode, this is not guaranteed.
{:.instructor_notes}

In this section, we will use two example problems: parallel numerical integration and finding maximum in an array to look at how to control access to global variables.

## Parallel Numerical Integration
As our example, let's integrate the sine function from 0 to $\pi$.
Here is an example of an integration of the sine function from 0 to $\pi$.

This is the same as the area under the first half of a sine curve. To compute approximation of an integral we will use the simplest Rectangle Method. We will partition area under the curve into a number of very narrow rectangles and add areas of these small shapes together.
{:.instructor_notes}

The single-threaded version is:
Serial version is as follows:

<div class="gitfile" markdown="1">

Expand All @@ -50,9 +44,11 @@ int main(int argc, char **argv) {
}
~~~
{:.language-c}
[integrate_sin_template.c](https://github.com/ssvassiliev/ACENET_Summer_School_OpenMP_2023/raw/gh-pages/code/integrate_sin_template.c)
[integrate_sin_serial.c](https://github.com/ssvassiliev/ACENET_Summer_School_OpenMP_2023/raw/gh-pages/code/integrate_sin_serial.c)
</div>

Make a copy of the file integrate_sin_serial.c that you will work with to parallelize. You can call it integrate_sin_omp.c.

### Compiling Code with Mathematical Functions
To compile a C program that uses math library with GCC we need to explicitly link to it:
~~~
Expand All @@ -70,9 +66,6 @@ Using 1e+07 steps
The integral of sine from 0 to Pi is 2.000000000000
~~~
{:.output}

The result in this case should be 2.0, and with 1e7 steps our program computed it accurately. To see what happens to the time this program takes, we'll use a new tool. Since we just want to see the total time, we can use the command *time*:
{:.instructor_notes}

- Use the `time` utility to get the execution time:
{:.self_study_text}
Expand All @@ -97,7 +90,7 @@ The *user* and *sys* lines describe how much time was spent in the "user" code a
{:.instructor_notes}

## Parallelizing Numerical Integration
The program spends most of its time computing areas of small rectangles and adding them together. Let's parallelize the main loop and execute the code.
Let's parallelize the main loop and execute the code.

<div class="gitfile" markdown="1">

Expand All @@ -111,26 +104,18 @@ The program spends most of its time computing areas of small rectangles and addi
integrate_sin_omp.c
</div>

- The data dependency on *total* leads to a race condition.
- The result is incorrect because data dependency on *total* leads to a race condition when the program is run in parallel.
{:.self_study_text}

The program works, but the result is incorrect when we use more than one thread. What is the problem? The data dependency on *total* leads to a race condition. Since we are updating a global variable, there is a race between the various threads as to who can read and then write the value of *total*. Multiple threads could read the current value, before a working thread can write the result of its addition. So these reading threads essentially miss out on some additions to the total.
{:.instructor_notes}

### How to Avoid Data Race Conditions?
One strategy to avoid data race is to synchronize threads to ensure that the variable is accessed in the right order. OpenMP provides a number of ways to ensure that threads are executed in the right order.
{:.instructor_notes}
- Data races can be avoided by synchronizing threads so that variables are accessed in the correct order.

#### The *omp critical* Directive
Race conditions can be avoided by adding a *critical* directive.

- A *critical* directive only allows one thread at a time to run some block of code.

The *critical* directive is a very general construct; it can be applied to any arbitrary code block. It lets you ensure it is executed exclusively by one thread. It does it by locking this code block for other threads when it is executed. The drawback of this directive is poor performance. The first thread that acquires the lock makes progress, while others sit and wait in line until it has finished. In addition, significant overhead is added every time a thread enters and exits the critical section. And this overhead is on top of the inherent cost of serialization.
{:.instructor_notes}


Let's add the *omp critical* directive to the statement in the main loop and rerun the code.
Add the *omp critical* directive to the statement in the main loop and rerun the code.

<div class="gitfile" markdown="1">

Expand All @@ -153,20 +138,18 @@ srun -c2 time -p ./integrate
~~~
{:.language-bash}

The addition of the *critical* directive slows down the program relative to the serial version. And if we run it with more threads, it slows down even more.
{:.instructor_notes}

- Using *critical*, in this case, is a wrong decision because *critical* serializes the whole loop.

#### The *omp atomic* Directive

- The *omp atomic* directive is similar to *critical* but one thread being in an atomic operation doesn't block any other atomic operations about to happen.
{:.self_study_text}

Another way to avoid race conditions is to use *omp atomic* directive. The *omp atomic* directive is similar to *critical* but one thread being in an atomic operation doesn't block any other atomic operations about to happen. Where available, *atomic* takes advantage on the CPU instructions providing atomic operations. Depending on the CPU architecture, some CPU instructions such as such as read-modify-write, fetch-and-add, compare-and-swap, ..etc) are atomic. These instructions perform multiple things in memory in a single, atomic step which can not be interrupted. In that case there's no lock/unlock needed on entering/exiting the line of code, it just does the atomic operation, and hardware (or OS) ensures that it is not interfered with. Another advantage of the *omp atomic* directive is much lower overhead.
{:.instructor_notes}

- The downsides are that it can be used only to control a single statement and the set of operations that atomic supports is restricted. Of course, with both *omp critical* and *omp atomic*, you incur the cost of serialization.
- The downsides:
- it can be used only to control a single statement
- the set of operations that atomic supports is restricted
- there are serialization costs associated with both *omp critical* and *omp atomic*.

Examples demonstrating how to use atomic:
~~~
Expand All @@ -184,7 +167,7 @@ Examples demonstrating how to use atomic:
~~~
{:.language-c}

The *atomic* clauses: update (default), write, read, capture
The *atomic* clauses: update (default), write, read, capture

> ## Parallel Performance
> - Replace the *critical* directive with the *atomic* directive, recompile the code and run it on more that one thread. Try different number of threads (1-8) and record execution time.
Expand All @@ -205,18 +188,13 @@ The *atomic* clauses: update (default), write, read, capture
{: .challenge}

### The Optimal Way to Parallelize Integration with OpenMP
The best option to parallelize summation is to let each thread to operate on its own chunk of data and when all threads finish add up their sums together.
The best option is to let each thread work on its own chunk of data and then sum the sums of all threads.
The *reduction* clause lets you specify thread-private variables that are subject to a reduction operation at the end of the parallel
region

OpenMP provides a specific thread-safe mechanism to do this: the *reduction* clause.
{:.instructor_notes}

- The *reduction* clause lets you specify thread-private variables that are subject to a reduction operation at the end of the parallel region.

As we are doing summation, the reduction operation we are looking for is "+". At the end of the reduction, the values of all private copies of the shared variable will be added together, and the final result will be written to the global shared variable.
{:.instructor_notes}

Let's comment out both the *critical* and the *atomic* directives and add the *reduction* variable *total* subjected to the reductions operator "+" to the parallel *for* loop:
{:.instructor_notes}
- the reduction operation for summation is is "+"
- at the end of the parallel construct the values of all private copies of the shared variable will be added together

~~~
#pragma omp parallel for reduction(+: total)
Expand All @@ -225,15 +203,18 @@ Let's comment out both the *critical* and the *atomic* directives and add the *r

Recompile the code and execute. Now we got the right answer, and x3.7 speedup with 4 threads!

In addition to summation OpenMP supports several other reduction operations, such as multiplication, minimum, maximum, logical operators. Next, we will look at other uses of reduction variables.

- Other reduction operations:
- subtraction
- multiplication
- minimum, maximum
- logical operators.

> ## Finding the Maximum Value in an Array
>Let's say that we need to search through an array to find the largest value.
>How could we do this type of search in parallel?
>Imagine that you are trying to find the largest value in an array.
>Is there a way to do this type of search in parallel?
>Begin with the serial version:
>~~~
>/* --- File array_max.c --- */
>/* --- File array_max_serial.c --- */
>#include <stdio.h>
>#include <stdlib.h>
>#include <omp.h>
Expand Down Expand Up @@ -264,7 +245,9 @@ In addition to summation OpenMP supports several other reduction operations, suc
>{:.language-c}
>
>> ## Solution
>> This problem is analogous to the summation. You would want to make sure that each thread has a private copy of the *curr_max* variable, since it will be written to. When all threads have found the maximum value in their share of data you would want to find out which thread has the largest value.
>> This problem is analogous to the summation.
>> - Because the *curr_max* variable will be written to, each thread should have its own copy.
>> - Once all threads have found the maximum value in their share of data, you need to determine which thread has the largest value.
>> ~~~
>> ...
>> curr_max = 0.0;
Expand All @@ -274,4 +257,3 @@ In addition to summation OpenMP supports several other reduction operations, suc
>> {:.language-c}
> {: .solution}
{: .challenge}

Loading

0 comments on commit ee6e593

Please sign in to comment.