diff --git a/.buildinfo b/.buildinfo
new file mode 100644
index 0000000..4805bc4
--- /dev/null
+++ b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 740bec8f81608ce9b21967ff23915748
+tags: d77d1c0d9ca2f4c8421862c7c5a0d620
diff --git a/.nojekyll b/.nojekyll
new file mode 100644
index 0000000..e69de29
diff --git a/0-setup/index.html b/0-setup/index.html
new file mode 100644
index 0000000..e6c4ae4
--- /dev/null
+++ b/0-setup/index.html
@@ -0,0 +1,232 @@
+
+
+
Google Colaboratory, commonly referred to as “Colab”, is a cloud-based Jupyter notebook environment which runs in your web browser. Using it requires login with a Google account.
+
This is how you can get access to NVIDIA GPUs on Colab:
Some exercises in this lesson rely on source code that you should download and modify in your own home directory on the cluster. All code examples are available in the same GitHub repository as this lesson itself. To download it you should use Git:
+
$ gitclonehttps://github.com/ENCCS/gpu-programming.git
+$ cdgpu-programming/content/examples/
+$ ls
+
It states that the number of transistors in a dense integrated circuit doubles about every two years.
+More transistors means smaller size of a single element, so higher core frequency can be achieved.
+However, power consumption scales with frequency to the third power, therefore the growth in the core frequency has slowed down significantly.
+Higher performance of a single node has to rely on its more complicated structure and still can be achieved with SIMD (single instruction multiple data), branch prediction, etc.
+
+
Increasing performance has been sustained with two main strategies over the years:
+
+
+
Increase the single processor performance:
+
More recently, increase the number of physical cores.
The underlying idea of parallel computing is to split a computational problem into smaller
+subtasks. Many subtasks can then be solved simultaneously by multiple processing units.
+
+
How a problem is split into smaller subtasks strongly depends on the problem.
+There are various paradigms and programming approaches to do this.
Graphics processing units (GPU) have been the most common accelerators during the last few years, the term GPU sometimes is used interchangeably with the term accelerator.
+GPUs were initially developed for highly-parallel task of graphic processing.
+But over the years, they were used more and more in HPC.
+
GPUs are a specialized parallel hardware for floating point operations.
+They are basically co-processors (helpers) for traditional CPUs: CPU still controls the work flow
+but it delegates highly-parallel tasks to the GPU.
+GPUs are based on highly parallel architectures, which allows taking advantage of the
+increasing number of transistors.
+
Using GPUs allows one to achieve extreme performance per node.
+As a result, the single GPU-equipped workstation can outperform small CPU-based clusters
+for some type of computational tasks. The drawback is: usually major rewrites of programs is required
+with an accompanying change in the programming paradigm.
+
+
Host vs device
+
GPU-enabled systems require a heterogeneous programming model that involves both
+CPU and GPU, where the CPU and its memory are referred to as the host,
+and the GPU and its memory as the device.
The TOP500 project ranks and details the 500 most powerful non-distributed computer systems in the world. The project was started in 1993 and publishes an updated list of the supercomputers twice a year. The snapshot below shows the top-5 HPC systems as of June 2023, where the columns show:
+
+
Cores - Number of processors
+
Rmax - Maximal LINPACK performance achieved
+
Rpeak - Theoretical peak performance
+
Power - Power consumption
+
+
+
All systems in the top-5 positions contain GPUs from AMD or NVIDIA, except for Fugaku which instead relies on custom-built Arm A64FX CPUs.
Compared to CPUs, GPUs can perform more calculations per watt of power consumed,
+which can result in significant energy savings. This is indeed evident from the Green500 list.
Not all workloads can be efficiently parallelized and accelerated on GPUs.
+Certain types of workloads, such as those with irregular data access patterns or
+high branching behavior, may not see significant performance improvements on GPUs.
Depending on the GPU programming API that you choose, GPU computing could
+require specialized skills in GPU programming and knowledge of
+GPU architecture, leading to a steeper learning curve compared to CPU programming.
+Fortunately, if you study this training material closely you will become productive
+with GPU programming quickly!
+
+
Keypoints
+
+
GPUs are accelerators for some types of tasks
+
Highly parallilizable compute-intensive tasks are suitable for GPUs
+
New programming skills are needed to use GPUs efficiently
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/2-gpu-ecosystem/index.html b/2-gpu-ecosystem/index.html
new file mode 100644
index 0000000..45ed4b8
--- /dev/null
+++ b/2-gpu-ecosystem/index.html
@@ -0,0 +1,546 @@
+
+
+
+
+
+
+ The GPU hardware and software ecosystem — Introduction to GPU Programming documentation
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Accelerators offer high performance due to their scalability and high density of compute elements.
+
They have separate circuit boards connected to CPUs via PCIe bus, with their own memory.
+
CPUs copy data from their own memory to the GPU memory, execute the program, and copy the results back.
+
GPUs run thousands of threads simultaneously, quickly switching between them to hide memory operations.
+
Effective data management and access pattern is critical on the GPU to avoid running out of memory.
+
+
+
One of the most important features that allows the accelerators to reach this high performance is their scalability.
+Computational cores on accelerators are usually grouped into multiprocessors.
+The multiprocessors share the data and logical elements.
+This allows to achieve a very high density of compute elements on a GPU.
+This also allows for better scaling: more multiprocessors means more raw performance and this is very easy to achieve with more transistors available.
+
Accelerators are a separate main circuit board with the processor, memory, power management, etc.
+It is connected to the motherboard with CPUs via PCIe bus.
+Having its own memory means that the data has to be copied to and from it.
+CPU acts as a main processor, controlling the execution workflow.
+It copies the data from its own memory to the GPU memory, executes the program and copies the results back.
+GPUs runs tens of thousands of threads simultaneously on thousands of cores and does not do much of the data management.
+With many cores trying to access the memory simultaneously and with little cache available, the accelerator can run out of memory very quickly.
+This makes the data management and its access pattern is essential on the GPU.
+Accelerators like to be overloaded with the number of threads, because they can switch between threads very quickly.
+This allows to hide the memory operations: while some threads wait, others can compute.
CPUs and GPUs were designed with different goals in mind. While the CPU
+is designed to excel at executing a sequence of operations, called a thread,
+as fast as possible and can execute a few tens of these threads in parallel,
+the GPU is designed to excel at executing many thousands of them in parallel.
+GPUs were initially developed for highly-parallel task of graphic processing
+and therefore designed such that more transistors are devoted to data processing
+rather than data caching and flow control. More transistors dedicated to
+data processing is beneficial for highly parallel computations; the GPU can
+hide memory access latencies with computation, instead of relying on large data caches
+and complex flow control to avoid long memory access latencies,
+both of which are expensive in terms of transistors.
GPUs come together with software stacks or APIs that work in conjunction with the hardware and give a standard way for the software to interact with the GPU hardware. They are used by software developers to write code that can take advantage of the parallel processing power of the GPU, and they provide a standard way for software to interact with the GPU hardware. Typically, they provide access to low-level functionality, such as memory management, data transfer between the CPU and the GPU, and the scheduling and execution of parallel processing tasks on the GPU. They may also provide higher level functions and libraries optimized for specific HPC workloads, like linear algebra or fast Fourier transforms. Finally, in order to facilitate the developers to optimize and write correct codes, debugging and profiling tools are also included.
+
NVIDIA, AMD, and Intel are the major companies which design and produces GPUs for HPC providing each its own suite CUDA, ROCm, and respectively oneAPI. This way they can offer optimization, differentiation (offering unique features tailored to their devices), vendor lock-in, licensing, and royalty fees, which can result in better performance, profitability, and customer loyalty.
+There are also cross-platform APIs such DirectCompute (only for Windows operating system), OpenCL, and SYCL.
+
+
CUDA - In short
+
+
+
CUDA: NVIDIA’s parallel computing platform
+
Components: CUDA Toolkit & CUDA driver
+
Supports C, C++, and Fortran languages
+
+
+
+
+
+
CUDA API Libraries: cuBLAS, cuFFT, cuRAND, cuSPARSE
Compute Unified Device Architecture is the parallel computing platform from NVIDIA. The CUDA API provides a comprehensive set of functions and tools for developing high-performance applications that run on NVIDIA GPUs. It consists of two main components: the CUDA Toolkit and the CUDA driver. The toolkit provides a set of libraries, compilers, and development tools for programming and optimizing CUDA applications, while the driver is responsible for communication between the host CPU and the device GPU. CUDA is designed to work with programming languages such as C, C++, and Fortran.
+
CUDA API provides many highly optimize libraries such as: cuBLAS (for linear algebra operations, such a dense matrix multiplication), cuFFT (for performing fast Fourier transforms), cuRAND (for generating pseudo-random numbers), cuSPARSE (for sparse matrices operations). Using these libraries, developers can quickly and easily accelerate complex computations on NVIDIA GPUs without having to write low-level GPU code themselves.
+
There are several compilers that can be used for developing and executing code on NVIDIA GPUs: nvcc. The latest versions are based on the widely used LLVM (low level virtual machine) open source compiler infrastructure. nvcc produces optimized code for NVIDIA GPUs and drives a supported host compiler for AMD, Intel, OpenPOWER, and Arm CPUs.
+
In addition to this are provided nvc (C11 compiler), nvc++ (C++17 compiler), and nvfortran (ISO Fortran 2003 compiler). These compilers can as well create code for execution on the NVIDIA GPUs, and also support GPU and multicore CPU programming with parallel language features, OpeanACC and OpenMP.
+
When programming mistakes are inevitable they have to be fixed as soon as possible. The CUDA toolkit includes the command line tool cuda-gdb which can be used to find errors in the code. It is an extension to GDB, the GNU Project debugger. The existing GDB debugging features are inherently present for debugging the host code, and additional features have been provided to support debugging CUDA device code, allowing simultaneous debugging of both GPU and CPU code within the same application. The tool provides developers with a mechanism for debugging CUDA applications running on actual hardware. This enables developers to debug applications without the potential variations introduced by simulation and emulation environments.
+
In addition to this the command line tool compute-sanitizer can be used to look exclusively for memory access problems: unallocated buffers, out of bounds accesses, race conditions, and uninitialized variables.
+
Finally, in order to utilize the GPUs at maximum some performance analysis tools. NVIDIA provides NVIDIA Nsight Systems and NVIDIA Nsight Compute tools for helping the developers to optimize their applications. The former, NVIDIA Nsight Systems, is a system-wide performance analysis tool that provides detailed metrics on both CPU and GPU usage, memory bandwidth, and other system-level metrics. The latter, NVIDIA Nsight Compute, is a kernel-level performance analysis tool that allows developers to analyze the performance of individual CUDA kernels. It provides detailed metrics on kernel execution, including memory usage, instruction throughput, and occupancy. These tools have graphical which can be used for all steps of the performance analysis, however on supercomputers it is recommended to use the command line interface for collecting the information needed and then visualize and analyse the results using the graphical interface on personal computers.
+
Apart from what was presented above there are many others tools and features provided by NVIDIA. The CUDA eco-system is very well developed.
ROCm is an open software platform allowing researchers to tap the power of AMD accelerators.
+The ROCm platform is built on the foundation of open portability, supporting environments across multiple
+accelerator vendors and architectures. In some way it is very similar to CUDA API.
+It contains libraries, compilers, and development tools for programming and optimizing programs for AMD GPUs.
+For debugging, it provides the command line tool rocgdb, while for performance analysis rocprof and roctracer.
+In order to produce code for the AMD GPUs, one can use the Heterogeneous-Computing Interface for Portability (HIP).
+HIP is a C++ runtime API and a set of tools that allows developers to write portable GPU-accelerated code for both NVIDIA and AMD platforms.
+It provides the hipcc compiler driver, which will call the appropriate toolchain depending on the desired platform.
+On the AMD ROCm platform, HIP provides a header and runtime library built on top of the HIP-Clang (ROCm compiler).
+On an NVIDIA platform, HIP provides a header file which translates from the HIP runtime APIs to CUDA runtime APIs.
+The header file contains mostly inlined functions and thus has very low overhead.
+The code is then compiled with nvcc, the standard C++ compiler provided with CUDA.
+On AMD platforms, libraries are prefixed by roc, which can be called directly from HIP. In order to make portable calls,
+one can call the libraries using hip-prefixed wrappers. These wrappers can be used at no performance cost and ensure that
+HIP code can be used on other platforms with no changes. Libraries included in the ROCm, are almost one-to-one equivalent to the ones supplied with CUDA.
+
ROCm also integrates with popular machine learning frameworks such as TensorFlow and PyTorch and provides optimized libraries and drivers to accelerate machine learning workloads on AMD GPUs enabling the researchers to leverage the power of ROCm and AMD accelerators to train and deploy machine learning models efficiently.
Intel oneAPI is a unified software toolkit developed by Intel that allows developers to optimize and deploy applications across a variety of architectures, including CPUs, GPUs, and FPGAs. It provides a comprehensive set of tools, libraries, and frameworks, enabling developers to leverage the full potential of heterogeneous computing environments. With oneAPI, the developers can write code once and deploy it across different hardware targets without the need for significant modifications or rewriting. This approach promotes code reusability, productivity, and performance portability, as it abstracts the complexities of heterogeneous computing and provides a consistent programming interface based on open standards.
+
The core of suite is Intel oneAPI Base Toolkit, a set of tools and libraries for developing high-performance, data-centric applications across diverse architectures. It features an industry-leading C++ compiler that implements SYCL, an evolution of C++ for heterogeneous computing. It includes the Collective Communications Library, the Data Analytics Library, the Deep Neural Networks Library, the DPC++/C++ Compiler, the DPC++ Library, the Math Kernel Library, the Threading Building Blocks, debugging tool Intel Distribution for GDB, performance analisis tools Intel Adviser and Intel Vtune Profiler, the Video Processing Library, Intel Distribution for Python, the DPC++ Compatibility Tool, the FPGA Add-on for oneAPI Base Toolkit, the Integrated Performance Primitives.
+This can be complemented with additional toolkits. The Intel oneAPI HPC Toolkit contains DPC++/C++ Compiler, Fortran and C++ Compiler Classic, debugging tools Cluster Checker and Inspector, Intel MPI Library, and performance analysis tool Intel Trace Analyzer and Collector.
+
oneAPI supports multiple programming models and programming languages. It enables developers to write OpenMP codes targeting multi-core CPUs and Intel GPUs using the Classic Fortran and C++ compilers and as well SYCL programs for GPUs and FPGAs using the DPC++ compiler. Initially, the DPC++ compiler only targeted Intel GPUs using the oneAPI Level Zero low-level programming interface, but now support for NVIDIA GPUs (using CUDA) and AMD GPUs (using ROCm) has been added.
+Overall, Intel oneAPI offers a comprehensive and unified approach to heterogeneous computing, empowering developers to optimize and deploy applications across different architectures with ease. By abstracting the complexities and providing a consistent programming interface, oneAPI promotes code reusability, productivity, and performance portability, making it an invaluable toolkit for developers in the era of diverse computing platforms.
GPUs in general support different features, even among the same producer. In general newer cards come with extra
+features and sometimes old features are not supported anymore. It is important when compiling to create binaries
+targeting the specific architecture when compiling. A binary built for a newer card will not run on older devices,
+while a binary build for older devices might not run efficiently on newer architectures. In CUDA the compute
+capability which is targeted is specified by the -arch=sm_XY, where X specifies the major architecture and it is between 1 and 9, and Y the minor. When using HIP on NVIDIA platforms one needs to use compiling option --gpu-architecture=sm_XY, while on AMD platforms --offload-arch=gfxabc ( where abc is the architecture code such as 90a for the MI200 series or 908 for MI100 series).
+Note that in the case of portable (single source) programs one would specify openmp as well as target for
+compilation, enabling to run the same code on multicore CPU.
Please keep in mind, that this table is only a rough approximation.
+Each GPU architecture is different, and it’s impossible to make a 1-to-1 mapping between terms used by different vendors.
GPUs are designed to execute thousands of threads simultaneously, making them highly parallel processors. In contrast, CPUs excel at executing a smaller number of threads in parallel.
+
GPUs allocate a larger portion of transistors to data processing rather than data caching and flow control. This prioritization of data processing enables GPUs to effectively handle parallel computations and hide memory access latencies through computation.
+
GPU producers provide comprehensive toolkits, libraries, and compilers for developing high-performance applications that leverage the parallel processing power of GPUs. Examples include CUDA (NVIDIA), ROCm (AMD), and oneAPI (Intel).
+
These platforms offer debugging tools (e.g., cuda-gdb, rocgdb) and performance analysis tools (e.g., NVIDIA Nsight Systems, NVIDIA Nsight Compute, rocprof, roctracer) to facilitate code optimization and ensure efficient utilization of GPU resources.
Which statement about the relationship between GPUs and memory is true?
+
+
+
GPUs are not affected by memory access latencies.
+
+
+
+
GPUs can run out of memory quickly with many cores trying to access the memory simultaneously.
+
+
+
+
GPUs have an unlimited cache size.
+
+
+
+
GPUs prefer to run with a minimal number of threads to manage memory effectively.
+
+
+
+
+
Solution
+
The correct answer is B). This is true because GPUs run many threads simultaneously on thousands of
+cores, and with limited cache available, this can lead to the GPU running out of memory quickly if many
+cores are trying to access the memory simultaneously. This is why data management and access patterns
+are essential in GPU computing.
+
+
+
+
Keypoints
+
+
GPUs vs. CPUs, key differences between them
+
GPU software suites, support specific GPU features, programming models, compatibility
+
Applications of GPUs
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/3-gpu-problems/index.html b/3-gpu-problems/index.html
new file mode 100644
index 0000000..96eead9
--- /dev/null
+++ b/3-gpu-problems/index.html
@@ -0,0 +1,398 @@
+
+
+
+
+
+
+ What problems fit to GPU? — Introduction to GPU Programming documentation
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
From a metaphorical point of view, the GPU can be seen as a person lying on a bed
+of nails. The person lying on top is the data and in the base of each nail there
+is a processor, so the nail is actually an arrow pointing from processor to memory.
+All nails are in a regular pattern, like a grid. If the body is well spread,
+it feels good (performance is good), if the body only touches some spots of the
+nail bed, then the pain is bad (bad performance).
+
+
GPU computing is well-suited to problems that involve large amounts of data parallelism.
+Specifically, you can expect good performance on GPUs for:
+
+
Large-scale matrix and vector operations: Common in machine learning, scientific computing, and image processing.
+
Fourier transforms: Also common in machine learning, scientific computing, and image processing.
+
Monte Carlo simulations: Used across finance, physics, and other fields to simulate complex systems.
+
Molecular dynamics simulations: Used in chemistry, biochemistry and physics.
+
Computational fluid dynamics: Used in engineering, physics, and other fields.
+
Convolutional neural networks and computer vision algorithms.
+
Big data analytics: Clustering, classification, regression, etc.
Not all programming problems can efficiently leverage the parallelism offered by GPUs.
+Some types of problems that do not fit well on a GPU include:
+
+
Sequential tasks: Problems that require a series of dependent steps,
+where each step relies on the outcome of the previous step, are not well-suited
+for parallel processing. Examples include recursive algorithms, certain dynamic
+programming problems, and some graph traversal algorithms.
+
Fine-grained branching: GPUs perform best when the code being executed across
+different threads follows a similar control flow. When there is extensive
+branching (i.e., many if statements) within a kernel or algorithm, performance
+may suffer due to the divergence in execution paths among the GPU threads.
+
Low arithmetic intensity: GPUs excel at performing a large number of mathematical
+operations quickly. If a problem has low arithmetic intensity (i.e., a low ratio of
+arithmetic operations to memory accesses), the GPU may not be able to efficiently utilize
+its computational power, leading to underperformance.
+
Small data sets: If the problem involves a small data set that does not require significant
+parallelism, using a GPU may not result in noticeable performance gains. In such cases,
+the overhead of transferring data between the CPU and GPU, and the time spent initializing the GPU,
+may outweigh any potential benefits.
+
Limited parallelism: Some algorithms have inherent limitations on the degree of parallelism that can be
+achieved. In these cases, using a GPU may not lead to significant performance improvements.
+
Memory-bound problems: GPUs generally have less memory available compared to CPUs, and their memory bandwidth
+can be a limiting factor. If a problem requires a large amount of memory or involves memory-intensive operations,
+it may not be well-suited for a GPU.
To give a flavor of what type of performance gains we can achieve by porting a calculations to a GPU
+(if we’re lucky!), let’s look at a few case examples.
+
+
Effect of array size
+
Consider the case of matrix multiplication in the Julia language:
VASP is a popular software package used for electronic structure calculations. The figures below show the speedup observed in a recent benchmark study on the Perlmutter and Cori supercomputers, along with an analysis of total energy usage.
A great deal of computational resources are spent in Quantum Chemical calculations which involve
+the solution of the Hartree-Fock eigenvalue problem, which requires the diagonalization of the
+Fock matrix whose elements are given by:
The first term is related to the one electron contributions and the second term is related to the
+electron repulsion integrals (ERIs), in parenthesis, weighted by the by the density matrix
+\(D_{\gamma \delta}\). One of the most expensive parts in the solution of the Hartree-Fock equations is the
+processing (digestion) of the ERIs, one algorithm to do this task is as follows:
+
+
This algorithm is suitable for GPUs as it involves many arithmetic operations. In addition to this,
+there are symmetries and properties of the integrals that could be used to rearrange the loops in
+an efficient manner that fit GPU architectures.
A brief introduction into some of the work that is being done in the humanities that can benefit from utilizing GPUs.
+
Language models and NLP (natural language processing)
+
With the recent popularity of ChatGPT, the use of language models has come into the mainstream,
+however such models have been used in the humanities many years already. One of the biggest goals of humanities
+researchers is working with textual data which has increased exponentially over recent years due to the rise in
+social media. Analyzing such textual data to gain insights into questions of sociology, linguistics and various
+other fields have become increasingly reliant on using language models. Along with language models,
+the need for GPU access has become essential.
+
Archeology
+
The field of archeology also makes use of GPUs in their 3D modelling
+and rendering work. The biggest problem with archeological sites is that once they are excavated,
+they are destroyed, so any researchers who aren’t present at the site, would lose valuable insights into how
+it looked when it was found. However, with recent developments in technology and accessibility to high-performance
+computing, they are able to generate extremely detailed renderings of the excavation sites which act as a way to
+preserve the site for future researchers to gain critical insights and contribute to the research.
+
Cognitive Science
+
Techniques such as Markov Chain Monte Carlo (MCMC) sampling have proven to be invaluable in studies that delve into human behavior or population dynamics. MCMC sampling allows researchers to simulate and analyze complex systems by iteratively sampling from a Markov chain, enabling the exploration of high-dimensional parameter spaces. This method is particularly useful when studying human behavior, as it can capture the inherent randomness and interdependencies that characterize social systems. By leveraging MCMC sampling, researchers can gain insights into various aspects of human behavior, such as decision-making, social interactions, and the spread of information or diseases within populations.
+
By offloading the computational workload to GPUs, researchers can experience substantial speedup in the execution of MCMC algorithms. This speedup allows for more extensive exploration of parameter spaces and facilitates the analysis of larger datasets, leading to more accurate and detailed insights into human behavior or population dynamics. Examples of studies done using these methods can be found at the Center for Humanities Computing Aarhus (CHCAA) and Interacting Minds Centre (IMC) at Aarhus University.
Which of the following computational tasks is likely to gain the least performance benefit from being ported to a GPU?
+
+
Training a large, deep neural network.
+
Performing a Monte Carlo simulation with a large number of independent trials.
+
Executing an algorithm with heavy use of recursion and frequent branching.
+
Processing a large image with a convolutional filter.
+
+
+
Solution
+
The right answer is option 3. GPUs do not handle recursion and branching as effectively as more
+data-heavy algorithms.
+
+
+
+
Keypoints
+
+
GPUs excel in processing tasks with high data parallelism, such as large-scale matrix operations, Fourier transforms, and big data analytics.
+
GPUs struggle with sequential tasks, problems with extensive control flow divergence, low arithmetic intensity tasks, small data sets, and memory-bound problems.
Most of computing problems are not trivially parallelizable, which means that the subtasks
+need to have access from time to time to some of the results computed by other subtasks.
+The way subtasks exchange needed information depends on the available hardware.
+
+
In a distributed memory environment each computing unit operates independently from the
+others. It has its own memory and it cannot access the memory in other nodes.
+The communication is done via network and each computing unit runs a separate copy of the
+operating system. In a shared memory machine all computing units have access to the memory
+and can read or modify the variables within.
The type of environment (distributed- or shared-memory) determines the programming model.
+There are two types of parallelism possible, process based and thread based.
+
+
For distributed memory machines, a process-based parallel programming model is employed.
+The processes are independent execution units which have their own memory address spaces.
+They are created when the parallel program is started and they are only terminated at the
+end. The communication between them is done explicitly via message passing like MPI.
+
On the shared memory architectures it is possible to use a thread based parallelism.
+The threads are light execution units and can be created and destroyed at a relatively
+small cost. The threads have their own state information but they share the same memory
+address space. When needed the communication is done though the shared memory.
+
Both approaches have their advantages and disadvantages. Distributed machines are
+relatively cheap to build and they have an “infinite ” capacity. In principle one could
+add more and more computing units. In practice the more computing units are used the more
+time consuming is the communication. The shared memory systems can achieve good performance
+and the programming model is quite simple. However they are limited by the memory capacity
+and by the access speed. In addition in the shared parallel model it is much easier to
+create race conditions.
There are two types of parallelism that can be explored.
+The data parallelism is when the data can be distributed across computational units that can run in parallel.
+The units process the data by applying the same or very similar operation to different data elements.
+A common example is applying a blur filter to an image — the same function is applied to all the pixels on an image.
+This parallelism is natural for the GPU, where the same instruction set is executed in multiple threads.
+
+
Data parallelism can usually be explored by the GPUs quite easily.
+The most basic approach would be finding a loop over many data elements and converting it into a GPU kernel.
+If the number of elements in the data set is fairly large (tens or hundred of thousands elements), the GPU should perform quite well. Although it would be odd to expect absolute maximum performance from such a naive approach, it is often the one to take. Getting absolute maximum out of the data parallelism requires good understanding of how GPU works.
+
Another type of parallelism is a task parallelism.
+This is when an application consists of more than one task that requiring to perform different operations with (the same or) different data.
+An example of task parallelism is cooking: slicing vegetables and grilling are very different tasks and can be done at the same time.
+Note that the tasks can consume totally different resources, which also can be explored.
+
+
In short
+
+
Computing problems can be parallelized in distributed memory or shared memory architectures.
+
In distributed memory, each unit operates independently, with no direct memory access between nodes.
+
In shared memory, units have access to the same memory and can communicate through shared variables.
+
Parallel programming can be process-based (distributed memory) or thread-based (shared memory).
+
Process-based parallelism uses independent processes with separate memory spaces and explicit message passing.
+
Thread-based parallelism uses lightweight threads that share the same memory space and communicate through shared memory.
+
Data parallelism distributes data across computational units, processing them with the same or similar operations.
+
Task parallelism involves multiple independent tasks that perform different operations on the same or different data.
+
Task parallelism involves executing different tasks concurrently, leveraging different resources.
In order to obtain maximum performance it is important to understand how GPUs execute the programs. As mentioned before a CPU is a flexible device oriented towards general purpose usage. It’s fast and versatile, designed to run operating systems and various, very different types of applications. It has lots of features, such as better control logic, caches and cache coherence, that are not related to pure computing. CPUs optimize the execution by trying to achieve low latency via heavy caching and branch prediction.
+
+
In contrast the GPUs contain a relatively small amount of transistors dedicated to control and caching, and a much larger fraction of transistors dedicated to the mathematical operations. Since the cores in a GPU are designed just for 3D graphics, they can be made much simpler and there can be a very larger number of cores. The current GPUs contain thousands of CUDA cores. Performance in GPUs is obtain by having a very high degree of parallelism. Lots of threads are launched in parallel. For good performance there should be at least several times more than the number of CUDA cores. GPU threads are much lighter than the usual CPU threads and they have very little penalty for context switching. This way when some threads are performing some memory operations (reading or writing) others execute instructions.
In order to perform some work the program launches a function called kernel, which is executed simultaneously by tens of thousands of threads that can be run on GPU cores parallelly. GPU threads are much lighter than the usual CPU threads and they have very little penalty for context switching. By “over-subscribing” the GPU there are threads that are performing some memory operations (reading or writing), while others execute instructions.
+
+
Every thread is associated with a particular intrinsic index which can be used to calculate and access memory locations in an array. Each thread has its context and set of private variables. All threads have access to the global GPU memory, but there is no general way to synchronize when executing a kernel. If some threads need data from the global memory which was modified by other threads the code would have to be splitted in several kernels because only at the completion of a kernel it is ensured that the writing to the global memory was completed.
+
Apart from being much light weighted there are more differences between GPU threads and CPU threads. GPU threads are grouped together in groups called warps. This done at hardware level.
+
+
All memory accesses to the GPU memory are as a group in blocks of specific sizes (32B, 64B, 128B etc.). To obtain good performance the CUDA threads in the same warp need to access elements of the data which are adjacent in the memory. This is called coalesced memory access.
+
On some architectures, all members of a warp have to execute the
+same instruction, the so-called “lock-step” execution. This is done to achieve
+higher performance, but there are some drawbacks. If an if statement
+is present inside a warp will cause the warp to be executed more than once,
+one time for each branch. When different threads within a single warp
+take different execution paths based on a conditional statement (if), both
+branches are executed sequentially, with some threads being active while
+others are inactive. On architectures without lock-step execution, such
+as NVIDIA Volta / Turing (e.g., GeForce 16xx-series) or newer, warp
+divergence is less costly.
+
There is another level in the GPU threads hierarchy. The threads are grouped together in so called blocks. Each block is assigned to one Streaming Multiprocessor (SMP) unit. A SMP contains one or more SIMT (single instruction multiple threads) units, schedulers, and very fast on-chip memory. Some of this on-chip memory can be used in the programs, this is called shared memory. The shared memory can be used to “cache” data that is used by more than one thread, thus avoiding multiple reads from the global memory. It can also be used to avoid memory accesses which are not efficient. For example in a matrix transpose operation, we have two memory operations per element and only can be coalesced. In the first step a tile of the matrix is saved read a coalesced manner in the shared memory. After all the reads of the block are done the tile can be locally transposed (which is very fast) and then written to the destination matrix in a coalesced manner as well. Shared memory can also be used to perform block-level reductions and similar collective operations. All threads can be synchronized at block level. Furthermore when the shared memory is written in order to ensure that all threads have completed the operation the synchronization is compulsory to ensure correctness of the program.
+
+
Finally, a block of threads can not be splitted among SMPs. For performance blocks should have more than one warp. The more warps are active on an SMP the better is hidden the latency associated with the memory operations. If the resources are sufficient, due to fast context switching, an SMP can have more than one block active in the same time. However these blocks can not share data with each other via the on-chip memory.
+
To summarize this section. In order to take advantage of GPUs the algorithms must allow the division of work in many small subtasks which can be executed in the same time. The computations are offloaded to GPUs, by launching tens of thousands of threads all executing the same function, kernel, each thread working on different part of the problem. The threads are executed in groups called blocks, each block being assigned to a SMP. Furthermore the threads of a block are divided in warps, each executed by SIMT unit. All threads in a warp execute the same instructions and all memory accesses are done collectively at warp level. The threads can synchronize and share data only at block level. Depending on the architecture, some data sharing can be done as well at warp level.
+
In order to hide latencies it is recommended to “over-subscribe” the GPU. There should be many more blocks than SMPs present on the device. Also in order to ensure a good occupancy of the CUDA cores there should be more warps active on a given SMP than SIMT units. This way while some warps of threads are idle waiting for some memory operations to complete, others use the CUDA cores, thus ensuring a high occupancy of the GPU.
+
In addition to this there are some architecture-specific features of which the developers can take advantage. Warp-level operations are primitives provided by the GPU architecture to allow for efficient communication and synchronization within a warp. They allow threads within a warp to exchange data efficiently, without the need for explicit synchronization. These warp-level operations, combined with the organization of threads into blocks and clusters, make it possible to implement complex algorithms and achieve high performance on the GPU. The cooperative groups feature introduced in recent versions of CUDA provides even finer-grained control over thread execution, allowing for even more efficient processing by giving more flexibility to the thread hierarchy. Cooperative groups allow threads within a block to organize themselves into smaller groups, called cooperative groups, and to synchronize their execution and share data within the group.
+
Below there is an example of how the threads in a grid can be associated with specific elements of an array
+
+
The thread marked by orange color is part of a grid of threads size 4096. The threads are grouped in blocks of size 256. The “orange” thread has index 3 in the block 2 and the global calculated index 515.
+
For a vector addition example this would be used as follow c[index]=a[index]+b[index].
+
+
In short
+
+
GPUs have a different execution model compared to CPUs, with a focus on parallelism and mathematical operations.
+
GPUs consist of thousands of lightweight threads that can be executed simultaneously on GPU cores.
+
Threads are organized into warps, and warps are grouped into blocks assigned to streaming multiprocessors (SMPs).
+
GPUs achieve performance through high degrees of parallelism and efficient memory access.
+
Shared memory can be used to cache data and improve memory access efficiency within a block.
+
Synchronization and data sharing are limited to the block level, with some possible sharing at the warp level depending on the architecture.
+
Over-subscribing the GPU and maximizing warp and block occupancy help hide latencies and improve performance.
+
Warp-level operations and cooperative groups provide efficient communication and synchronization within a warp or block.
+
Thread indexing allows associating threads with specific elements in an array for parallel processing.
At the moment there are three major GPU producers: NVIDIA, Intel, and AMD. While the basic concept behind GPUs is pretty similar they use different names for the various parts. Furthermore there are software environments for GPU programming, some from the producers and some from external groups all having different naming as well. Below there is a short compilation of the some terms used across different platforms and software environments.
What are threads in the context of shared memory architectures?
+
+
Independent execution units with their own memory address spaces
+
Light execution units with shared memory address spaces
+
Communication devices between separate memory units
+
Programming models for distributed memory machines
+
+
+
Solution
+
Correct answer: b) Light execution units with shared memory address spaces
+
+
+
+
What is data parallelism?
+
+
Distributing data across computational units that run in parallel, applying the same or similar operations to different data elements.
+
Distributing tasks across computational units that run in parallel, applying different operations to the same data elements.
+
Distributing data across computational units that run sequentially, applying the same operation to all data elements.
+
Distributing tasks across computational units that run sequentially, applying different operations to different data elements.
+
+
+
Solution
+
Correct answer: a) Distributing data across computational units that run in parallel, applying the same or similar operations to different data elements.
+
+
+
+
What type of parallelism is natural for GPU?
+
+
Task Parallelism
+
Data Parallelism
+
Both data and task parallelism
+
Neither data nor task parallelism
+
+
+
Solution
+
Correct answer: b) Data Parallelism
+
+
+
+
What is a kernel in the context of GPU execution?
+
+
A specific section of the CPU used for memory operations.
+
A specific section of the GPU used for memory operations.
+
A type of thread that operates on the GPU.
+
A function that is executed simultaneously by tens of thousands of threads on GPU cores.
+
+
+
Solution
+
Correct answer: d) A function that is executed simultaneously by tens of thousands of threads on GPU cores.
+
+
+
+
What is coalesced memory access?
+
+
It’s when CUDA threads in the same warp access elements of the data which are adjacent in the memory.
+
It’s when CUDA threads in different warps access elements of the data which are far in the memory.
+
It’s when all threads have access to the global GPU memory.
+
It’s when threads in a warp perform different operations.
+
+
+
Solution
+
Correct answer: a) It’s when CUDA threads in the same warp access elements of the data which are adjacent in the memory.
+
+
+
+
What is the function of shared memory in the context of GPU execution?
+
+
It’s used to store global memory.
+
It’s used to store all the threads in a block.
+
It can be used to “cache” data that is used by more than one thread, avoiding multiple reads from the global memory.
+
It’s used to store all the CUDA cores.
+
+
+
Solution
+
Correct answer: c) It can be used to “cache” data that is used by more than one thread, avoiding multiple reads from the global memory.
+
+
+
+
What is the significance of over-subscribing the GPU?
+
+
It reduces the overall performance of the GPU.
+
It ensures that there are more blocks than SMPs present on the device, helping to hide latencies and ensure high occupancy of the GPU.
+
It leads to a memory overflow in the GPU.
+
It ensures that there are more SMPs than blocks present on the device.
+
+
+
Solution
+
Correct answer: b) It ensures that there are more blocks than SMPs present on the device, helping to hide latencies and ensure high occupancy of the GPU.
+
+
+
+
Keypoints
+
+
Parallel computing can be classified into distributed-memory and shared-memory architectures
+
Two types of parallelism that can be explored are data parallelism and task parallelism.
+
GPUs are a type of shared memory architecture suitable for data parallelism.
+
GPUs have high parallelism, with threads organized into warps and blocks and.
+
GPU optimization involves coalesced memory access, shared memory usage, and high thread and warp occupancy. Additionally, architecture-specific features such as warp-level operations and cooperative groups can be leveraged for more efficient processing.