GPU Programming Course @University of Luxembourg

Exercises for the course GPU Programming @University of Luxembourg.

Resources

The book "Programming massively parallel processors 4th edition" is a valuable resource.
The CUDA online courses with the code given on Moodle can also be helpful.

Introduction to CUDA Programming (`1-introduction-cuda`)

We overview the CUDA model of computation.

1-introduction-cuda/demo: contains the fully functional code demo shown and explained during the course.
1-introduction-cuda/exercises: contains the exercises (.cu files).
1-introduction-cuda/solutions contains the solutions of the exercises.

BLAS `saxpy` [`2-cuda/saxpy`]

We reimplement the saxpy routine of the BLAS (Basic Linear Algebra Subproblems) library that is widely used (and heavily optimized) on many systems. saxpy computes the simple operation result = scale*X+Y, where X, Y, and result are vectors of N elements (where N can be 20 millions in this project) and scale is a scalar. Note that saxpy performs two math operations (one multiply, one add) for every three elements used. saxpy is a trivially parallelizable computation and features predictable, regular data access and predictable execution cost.

Implement a basic sequential version of that program with time measurement.

void saxpy(float scale, const std::vector<float>& X, const std::vector<float>& Y, std::vector<float>& result);

Implement a CUDA kernel parallelizing this operation, by paying attention to memory accesses and using managed memory for simplicity. Measure the time and compare the efficiency. At the end of the computation, verify you get correct results by comparing with the results provided by the sequential implementation.
Use global memory instead of managed memory. For that, you need to allocate the memory using cudaMalloc, followed by a cudaMemcpy to transfer the data from the CPU to the GPU, and finally by calling your kernel.
Measure the time taken by your kernel excluding and including the memory transfer.
[Optional⭐] Set up a CMake project inspired by 2-cuda/floyd.

Floyd Warshall [`2-cuda/floyd`]

Modify Floyd Warshall to use battery::vector<int, battery::managed_allocator> instead of int** (check this tutorial).

Project: Prefix Parallel Sum

The project description is available here.

CUDA Stream `6-cuda-stream/histogram`

Implement a GPU version of the histogram computation.
Profile your code with nsys and use CUDA streams to improve the efficiency.

Project: Matrix Multiplication

The project description is available here.

Memory Transaction

Investigate on the example 7-memory-transaction/exercises/grid_min.cu for how many blocks the coalesced and non-coalesced versions are almost equivalent in terms of performance. Explain why.
In 7-memory-transaction/exercises/grid_min_bank.cu: Use shared memory on the example 7-memory-transaction/exercises/grid_min.cu and find a way to avoid bank conflicts.
In 2D arrays, why does the width of the array should be a multiple of 32?
Read this article to understand how to optimize prefix sum to avoid bank conflicts.

Matrix Representation and Graph Traversal

[⭐⭐] Program a GPU kernel converting a COO matrix into a CSR matrix (hint: use histogram and prefix sum algorithms).
Define the Compressed Sparse Column Format (CSC), which is similar to CSR but using column indexing.
Analyse and give the properties of the hybrid ELL-COO format.
Implement either top-down or bottom-up BFS, following the algorithms seen in class.
[⭐] It is possible to implement the BFS algorithms using only BLAS operations. Think about how to do it.
[⭐⭐] Implement it using cuBLAS and compare its efficiency with the previous implementation.

CUDA Optimization

Using cooperative groups (see grid_floyd_cooperative_group.cu), implement the two loops of Brent-Kung prefix sum as a single kernel using grid-wide synchronization.
Use shared memory in the Kogge-Stone algorithm (scan-v5).
Use warp primitives to implement some elements of find_repeat in the prefix sum project.

Acknowledgments

Some exercises are inspired from the course Parallel Computing at Stanford and used with the permission of Kayvon Fatahalian. Thanks to him and his teaching team!

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
0-slurm		0-slurm
1-introduction-cuda		1-introduction-cuda
2-cuda		2-cuda
3-scan		3-scan
4-matrix-computation		4-matrix-computation
5-IPU		5-IPU
6-cuda-stream/histogram		6-cuda-stream/histogram
7-memory-transaction/exercises		7-memory-transaction/exercises
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Programming Course @University of Luxembourg

Resources

Introduction to CUDA Programming (`1-introduction-cuda`)

BLAS `saxpy` [`2-cuda/saxpy`]

Floyd Warshall [`2-cuda/floyd`]

Project: Prefix Parallel Sum

CUDA Stream `6-cuda-stream/histogram`

Project: Matrix Multiplication

Memory Transaction

Matrix Representation and Graph Traversal

CUDA Optimization

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

License

ptal/gpu-programming-uni.lu

Folders and files

Latest commit

History

Repository files navigation

GPU Programming Course @University of Luxembourg

Resources

Introduction to CUDA Programming (1-introduction-cuda)

BLAS saxpy [2-cuda/saxpy]

Floyd Warshall [2-cuda/floyd]

Project: Prefix Parallel Sum

CUDA Stream 6-cuda-stream/histogram

Project: Matrix Multiplication

Memory Transaction

Matrix Representation and Graph Traversal

CUDA Optimization

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Introduction to CUDA Programming (`1-introduction-cuda`)

BLAS `saxpy` [`2-cuda/saxpy`]

Floyd Warshall [`2-cuda/floyd`]

CUDA Stream `6-cuda-stream/histogram`

Packages