Calculation formula of thread index:
- grid 1D, block 1D
int threadId = blockIdx.x *blockDim.x + threadIdx.x; - grid 1D, block 2D
int threadId = blockIdx.x * blockDim.x * blockDim.y + threadIdx.y * blockDim.x + threadIdx.x; - grid 1D, block 3D
int threadId = blockIdx.x * blockDim.x * blockDim.y * blockDim.z + threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x + threadIdx.x; - grid 2D, block 1D
int blockId = blockIdx.y * gridDim.x + blockIdx.x; int threadId = blockId * blockDim.x + threadIdx.x; - grid 2D, block 2D
int blockId = blockIdx.x + blockIdx.y * gridDim.x; int threadId = blockId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x; - grid 2D, block 3D
int blockId = blockIdx.x + blockIdx.y * gridDim.x; int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x; - grid 3D, block 1D
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; int threadId = blockId * blockDim.x + threadIdx.x; - grid 3D, block 2D
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; int threadId = blockId * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x; - grid 3D, block 3D
int blockId = blockIdx.x + blockIdx.y * gridDim.x + gridDim.x * gridDim.y * blockIdx.z; int threadId = blockId * (blockDim.x * blockDim.y * blockDim.z) + (threadIdx.z * (blockDim.x * blockDim.y)) + (threadIdx.y * blockDim.x) + threadIdx.x;
Accelerated computing is replacing CPU-only computing as best practice. The parade of breakthroughs driven by accelerated computing, the ever increasing demand for accelerated applications, programming conventions that ease writing them, and constant improvements in the hardware that supports them, are driving this inevitible transition.
At the center of accelerated computing's success, both in terms of its impressive performance, and its ease of use, is the CUDA compute platform. CUDA provides a coding paradigm that extends languages like C, C++, Python, and Fortran, to be capable of running accelerated, massively parallelized code on the world's most performant parallel processors: NVIDIA GPUs. CUDA accelerates applications drastically with little effort, has an ecosystem of highly optimized libraries for DNN, BLAS, graph analytics, FFT, and more, and also ships with powerful command line and visual profilers.
CUDA supports many, if not most, of the world's most performant applications in: Computational Fluid Dynamics, Molecular Dynamics, Quantum Chemistry, Physics and HPC.
Learning CUDA will enable you to accelerate your own applications. Accelerated applications perform much faster than their CPU-only counterparts, and make possible computations that would be otherwise prohibited given the limited performance of CPU-only applications. In this lab you will receive an introduction to programming accelerated applications with CUDA C/C++, enough to be able to begin work accelerating your own CPU-only applications for performance gains, and for moving into novel computational territory.
By the time you complete this lab, you will be able to:
- Write, compile, and run C/C++ programs that both call CPU functions and launch GPU kernels.
- Control parallel thread hierarchy using execution configuration.
- Refactor serial loops to execute their iterations in parallel on a GPU.
- Allocate and free memory available to both CPUs and GPUs.
- Handle errors generated by CUDA code.
- Accelerate CPU-only applications.
With the techniques and tools that you’ve learned now at your disposal, you are almost ready to start accelerating your own real world applications. This section will provide you with details for:
- Setting up your own CUDA-enabled environment
- How best to continue your accelerated programming learning
- An additional practice problem
- Other helpful resources
The 2 easiest ways for you to set up a CUDA-enabled environment for your own work are:
- Via cloud provider
- Installing CUDA on your own system with an NVIDIA GPU
All major cloud providers provide NVIDIA GPU-enabled instances. A simple web search for “NVIDIA GPU ” will result in a hit for how to set one up on your cloud provider of choice. These instances will have the CUDA toolkit installed. You can simply SSH in and set to work.
If you have access to a system with an NVIDIA GPU, but have not yet installed the CUDA toolkit, simply follow the directions here for downloading and installing CUDA on your particular operating system.
After setting up your own accelerated system, there is one very best thing you can do to further your development as an accelerated computing programmer, work to accelerate your own applications. In addition,
You have learned how to approach accelerated computing iteratively, and in a profile-driven manner, so:
- Take some baseline measurements of a compute-intensive application you work on
- Make some hypotheses about where you might accelerate it
- Make some naive changes
- Profile and repeat
Even though you are ready to accelerate CPU-only applications in ways that will meaningfully improve their performance, you can take the study of accelerated computing much further, and in time, you should.
The CUDA C Best Practices Guide is an essential resource for effective CUDA programming. After accelerating your own application in the ways you already know, start a study of this document, applying the techniques it describes to further improve your application’s performance.
By far the best practice is to accelerate your own applications, but for those of you who might not yet have a real-world use case, try your hand at accelerating the following Mandelbrot Set simulator. Per usual, take an iterative and profile-driven approach.
- Mandelbrot Set Simulator: this C++ simulation includes a link to a detailed explanation of the application, and will allow you to visually see the impact of GPU-acceleration
A lot of highly talented programmers have used CUDA to create highly optimized libraries to be used for accelerated computing. There are many scenarios in your own applications where you will need to write your own CUDA code, but as usual in programming, there are also many scenarios where someone else has already written the code for you.
Peruse GPU-Accelerated Libraries for Computing to learn where you can use highly optimized CUDA libraries for tasks like basic linear algebra solvers (BLAS), graph analytics, fast Fourier transforms (FFT), random number generation (RNG), and image and signal processing, to name a few.
This is a quick-start for users who just want to get going.
-
Use the nvidia AMI on AWS (10 minutes): Deploy on Amazon EC2
-
Get started with nvidia-docker (5 minutes): nvidia-docker
-
Get started with the CUDA development image (5 minutes): "docker pull nvidia/cuda:9.1-devel"