Modern computing systems are heterogeneous — they contain multiple types of processors working together. The most common configuration pairs a CPU (Central Processing Unit) with a GPU (Graphics Processing Unit).
┌─────────────────────────────────────────────────────────────┐
│ Heterogeneous Computing System │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ CPU │ │ GPU │ │
│ │ (Host) │◄────────────►│ (Device) │ │
│ │ │ PCIe Bus │ │ │
│ │ 4-16 cores │ │ 1000s of cores │ │
│ │ Complex │ │ Simple but many │ │
│ │ Serial │ │ Parallel │ │
│ └─────────────┘ └─────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────────────┐ │
│ │ System RAM │ │ GPU Memory │ │
│ │ (Host) │ │ (Device) │ │
│ │ 8-64 GB │ │ 4-48 GB │ │
│ └─────────────┘ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
| Aspect | CPU | GPU |
|---|---|---|
| Core Count | 4-64 cores | 1000-10000+ cores |
| Core Complexity | Complex, out-of-order | Simple, in-order |
| Clock Speed | 3-5 GHz | 1-2 GHz |
| Cache per Core | Large (MB) | Small (KB) |
| Optimized for | Latency | Throughput |
| Best at | Serial tasks | Parallel tasks |
GPUs were originally designed for rendering graphics, where the same operation (like calculating pixel color) must be applied to millions of pixels independently. This Single Instruction, Multiple Data (SIMD) pattern is common in many computational domains:
- Scientific Computing: Matrix operations, simulations
- Machine Learning: Neural network training and inference
- Signal Processing: FFT, convolution, filtering
- Computer Vision: Image processing, feature detection
- Finance: Monte Carlo simulations, risk analysis
CUDA (Compute Unified Device Architecture) is NVIDIA's platform for general-purpose GPU computing. It provides:
- Programming Model: Extensions to C/C++ for GPU programming
- Compiler: nvcc compiles CUDA code
- Runtime API: Functions for memory management and kernel launching
- Libraries: cuBLAS, cuDNN, cuFFT, etc.
| Term | Definition |
|---|---|
| Host | The CPU and its memory |
| Device | The GPU and its memory |
| Kernel | A function that runs on the GPU |
| Thread | Single execution unit on GPU |
| Block | Group of threads that can cooperate |
| Grid | Collection of all blocks for a kernel |
Let's examine the structure of a basic CUDA program:
// Kernel definition - runs on GPU
__global__ void addKernel(int* a, int* b, int* c) {
int idx = threadIdx.x;
c[idx] = a[idx] + b[idx];
}
int main() {
// 1. Declare host and device pointers
int h_a[3] = {1, 2, 3};
int h_b[3] = {4, 5, 6};
int h_c[3];
int *d_a, *d_b, *d_c;
// 2. Allocate device memory
cudaMalloc(&d_a, 3 * sizeof(int));
cudaMalloc(&d_b, 3 * sizeof(int));
cudaMalloc(&d_c, 3 * sizeof(int));
// 3. Copy data from host to device
cudaMemcpy(d_a, h_a, 3 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, 3 * sizeof(int), cudaMemcpyHostToDevice);
// 4. Launch kernel
addKernel<<<1, 3>>>(d_a, d_b, d_c);
// 5. Copy result from device to host
cudaMemcpy(h_c, d_c, 3 * sizeof(int), cudaMemcpyDeviceToHost);
// 6. Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}┌────────────────────────────────────────────────────────────┐
│ CUDA Program Flow │
└────────────────────────────────────────────────────────────┘
HOST (CPU) DEVICE (GPU)
────────── ────────────
│
▼
┌─────────────┐
│ Allocate │
│ host memory │
└─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ cudaMalloc │ ──────────────────►│ Allocate │
│ │ │ device mem │
└─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ cudaMemcpy │ ═══════════════════►│ Data copied │
│ H → D │ PCIe Transfer │ to device │
└─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ Launch │ ──────────────────►│ Kernel │
│ kernel<<<>>>│ │ executes │
└─────────────┘ │ in parallel │
│ └─────────────┘
│ (CPU continues │
│ or waits) │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ cudaMemcpy │◄═══════════════════│ Results │
│ D → H │ PCIe Transfer │ ready │
└─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ cudaFree │ ──────────────────►│ Free device │
│ │ │ memory │
└─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Use results │
└─────────────┘
Functions that run on the GPU are marked with __global__:
__global__ void myKernel(int* data) {
// This code runs on the GPU
int idx = threadIdx.x;
data[idx] = data[idx] * 2;
}| Qualifier | Executes on | Called from |
|---|---|---|
__global__ |
Device (GPU) | Host (CPU) |
__device__ |
Device | Device |
__host__ |
Host | Host |
__device__ int square(int x) {
return x * x; // Helper function on GPU
}
__global__ void kernel(int* data) {
data[threadIdx.x] = square(data[threadIdx.x]);
}CUDA organizes parallel execution in a three-level hierarchy:
┌─────────────────────────────────────────────────────────────┐
│ GRID │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ BLOCK │ │ BLOCK │ │ BLOCK │ │
│ │ (0,0) │ │ (1,0) │ │ (2,0) │ │
│ │ ┌─┬─┬─┬─┐ │ │ ┌─┬─┬─┬─┐ │ │ ┌─┬─┬─┬─┐ │ │
│ │ │T│T│T│T│ │ │ │T│T│T│T│ │ │ │T│T│T│T│ │ │
│ │ ├─┼─┼─┼─┤ │ │ ├─┼─┼─┼─┤ │ │ ├─┼─┼─┼─┤ │ │
│ │ │T│T│T│T│ │ │ │T│T│T│T│ │ │ │T│T│T│T│ │ │
│ │ └─┴─┴─┴─┘ │ │ └─┴─┴─┴─┘ │ │ └─┴─┴─┴─┘ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ BLOCK │ │ BLOCK │ │ BLOCK │ │
│ │ (0,1) │ │ (1,1) │ │ (2,1) │ │
│ │ ┌─┬─┬─┬─┐ │ │ ┌─┬─┬─┬─┐ │ │ ┌─┬─┬─┬─┐ │ │
│ │ │T│T│T│T│ │ │ │T│T│T│T│ │ │ │T│T│T│T│ │ │
│ │ ├─┼─┼─┼─┤ │ │ ├─┼─┼─┼─┤ │ │ ├─┼─┼─┼─┤ │ │
│ │ │T│T│T│T│ │ │ │T│T│T│T│ │ │ │T│T│T│T│ │ │
│ │ └─┴─┴─┴─┘ │ │ └─┴─┴─┴─┘ │ │ └─┴─┴─┴─┘ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Grid Dimensions: 3 × 2 blocks │
│ Block Dimensions: 4 × 2 threads │
│ Total Threads: 3 × 2 × 4 × 2 = 48 threads │
└─────────────────────────────────────────────────────────────┘
CUDA provides built-in variables to identify each thread:
| Variable | Type | Description |
|---|---|---|
threadIdx |
dim3 | Thread index within block |
blockIdx |
dim3 | Block index within grid |
blockDim |
dim3 | Dimensions of each block |
gridDim |
dim3 | Dimensions of the grid |
__global__ void kernel() {
// 1D indexing
int tid = threadIdx.x;
int bid = blockIdx.x;
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
// 2D indexing
int row = threadIdx.y;
int col = threadIdx.x;
int globalRow = blockIdx.y * blockDim.y + threadIdx.y;
int globalCol = blockIdx.x * blockDim.x + threadIdx.x;
}Kernels are launched with execution configuration:
kernel<<<gridDim, blockDim>>>(arguments);
kernel<<<gridDim, blockDim, sharedMemBytes, stream>>>(arguments);// Launch 256 threads in a single block
kernel<<<1, 256>>>(data);
// Launch 1024 threads across 4 blocks
kernel<<<4, 256>>>(data);
// Process n elements
int threadsPerBlock = 256;
int numBlocks = (n + threadsPerBlock - 1) / threadsPerBlock;
kernel<<<numBlocks, threadsPerBlock>>>(data, n);// 2D block of threads
dim3 threadsPerBlock(16, 16); // 256 threads total
// 2D grid of blocks
dim3 numBlocks(
(width + 15) / 16,
(height + 15) / 16
);
kernel<<<numBlocks, threadsPerBlock>>>(data, width, height);// 1D global index
__global__ void kernel1D(int* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx] = data[idx] * 2;
}
}
// 2D global index
__global__ void kernel2D(int* data, int width, int height) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col < width && row < height) {
int idx = row * width + col; // Row-major layout
data[idx] = data[idx] * 2;
}
}int* d_data;
// Allocate device memory
cudaMalloc(&d_data, n * sizeof(int));
// Use the memory...
// Free device memory
cudaFree(d_data);// Host to Device
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
// Device to Host
cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost);
// Device to Device
cudaMemcpy(d_dest, d_src, size, cudaMemcpyDeviceToDevice);┌────────────────────────────────────────────────────────────┐
│ Memory Transfers │
└────────────────────────────────────────────────────────────┘
HOST DEVICE
┌─────────┐ ┌─────────┐
│ h_src │ ════════════════════► │ d_dst │
│ │ cudaMemcpyHostToDevice│ │
└─────────┘ └─────────┘
┌─────────┐ ┌─────────┐
│ h_dst │ ◄════════════════════ │ d_src │
│ │ cudaMemcpyDeviceToHost│ │
└─────────┘ └─────────┘
┌─────────┐
│ d_src │
│ ║ │
│ ▼ │
│ d_dst │
└─────────┘
cudaMemcpyDeviceToDevice
Always check for CUDA errors in production code:
cudaError_t err;
// Check malloc
err = cudaMalloc(&d_data, size);
if (err != cudaSuccess) {
printf("cudaMalloc failed: %s\n", cudaGetErrorString(err));
return -1;
}
// Check kernel launch
kernel<<<blocks, threads>>>(data);
err = cudaGetLastError();
if (err != cudaSuccess) {
printf("Kernel launch failed: %s\n", cudaGetErrorString(err));
}
// Check memcpy
err = cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
printf("cudaMemcpy failed: %s\n", cudaGetErrorString(err));
}A GPU contains multiple Streaming Multiprocessors (SMs). Each SM can execute multiple thread blocks concurrently:
┌─────────────────────────────────────────────────────────────┐
│ GPU │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ SM 0 │ │ SM 1 │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ CUDA Cores │ │ │ │ CUDA Cores │ │ │
│ │ │ (64-128) │ │ │ │ (64-128) │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ Shared Mem │ │ │ │ Shared Mem │ │ │
│ │ │ (48-164 KB) │ │ │ │ (48-164 KB) │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ L1 Cache │ │ │ │ L1 Cache │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │
│ │ │ Registers │ │ │ │ Registers │ │ │
│ │ │ (64K 32-bit) │ │ │ │ (64K 32-bit) │ │ │
│ │ └──────────────┘ │ │ └──────────────┘ │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ │
│ │ SM 2 │ │ SM n │ │
│ │ ... │ │ ... │ │
│ └──────────────────┘ └──────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ L2 Cache │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Global Memory (GDDR/HBM) │ │
│ │ (4-80 GB) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
A warp is a group of 32 threads that execute together in lockstep. All threads in a warp execute the same instruction at the same time (SIMT - Single Instruction, Multiple Threads).
┌─────────────────────────────────────────────────────────────┐
│ Thread Block (128 threads) │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Warp 0: Threads 0-31 │ │
│ │ ┌─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┬─┐ │
│ │ │0│1│2│3│4│5│6│7│8│9│... │31│ │
│ │ └─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┴─┘ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Warp 1: Threads 32-63 │ │
│ │ ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐ │ │
│ │ │32│33│34│35│36│37│... │63│ │ │
│ │ └──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Warp 2: Threads 64-95 │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Warp 3: Threads 96-127 │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
When threads in a warp take different paths through conditional code, warp divergence occurs:
__global__ void kernel(int* data) {
int idx = threadIdx.x;
if (idx % 2 == 0) {
data[idx] = 1; // Even threads execute this
} else {
data[idx] = 2; // Odd threads execute this
}
// Both paths are serialized!
}Warp Execution with Divergence:
───────────────────────────────
Time ──►
┌────────────────────┐ ┌────────────────────┐
│ if-branch │ │ else-branch │
│ (even threads) │ │ (odd threads) │
│ active │ │ active │
└────────────────────┘ └────────────────────┘
Threads 0,2,4,6... Threads 1,3,5,7...
execute wait (masked)
Threads 1,3,5,7...
Threads 0,2,4,6... execute
wait (masked)
Best Practice: Minimize divergence by grouping similar work together.
The GPU scheduler assigns thread blocks to SMs as resources become available:
┌─────────────────────────────────────────────────────────────┐
│ Block Scheduling │
└─────────────────────────────────────────────────────────────┘
Grid with 8 blocks:
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ B0 │ B1 │ B2 │ B3 │ B4 │ B5 │ B6 │ B7 │
└────┴────┴────┴────┴────┴────┴────┴────┘
GPU with 2 SMs (each can run 2 blocks):
Time 0:
SM0: [B0] [B1] SM1: [B2] [B3]
Time 1 (B0, B2 finish):
SM0: [B4] [B1] SM1: [B5] [B3]
Time 2 (B1, B3 finish):
SM0: [B4] [B6] SM1: [B5] [B7]
...and so on
Occupancy is the ratio of active warps to maximum possible warps on an SM. Higher occupancy can hide memory latency:
Occupancy = Active Warps / Maximum Warps
Factors limiting occupancy:
- Registers per thread
- Shared memory per block
- Threads per block
Synchronizes all threads within a block:
__global__ void kernel() {
__shared__ int data[256];
// Phase 1: All threads write
data[threadIdx.x] = threadIdx.x;
__syncthreads(); // Wait for all writes to complete
// Phase 2: All threads read (now safe)
int value = data[255 - threadIdx.x];
}Warning: All threads in the block must reach __syncthreads(). Conditional usage can cause deadlock:
// DANGEROUS - can cause deadlock!
if (threadIdx.x < 16) {
__syncthreads(); // Only some threads call this
}
// CORRECT
__syncthreads(); // All threads call this
if (threadIdx.x < 16) {
// Do conditional work
}For thread-safe updates to shared or global memory:
__global__ void histogram(int* data, int* bins, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
atomicAdd(&bins[data[idx]], 1);
}
}Common atomic operations:
atomicAdd(),atomicSub()atomicMax(),atomicMin()atomicAnd(),atomicOr(),atomicXor()atomicExch(),atomicCAS()
┌─────────────────────────────────────────────────────────────┐
│ CUDA Memory Hierarchy │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Per-Thread │ │
│ │ ┌─────────────┐ ┌─────────────────────────────┐ │ │
│ │ │ Registers │ │ Local Memory │ │ │
│ │ │ (Fastest) │ │ (Slow - in global) │ │ │
│ │ │ ~1 cycle │ │ For spilled registers │ │ │
│ │ └─────────────┘ └─────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Per-Block │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Shared Memory │ │ │
│ │ │ ~5 cycles │ │ │
│ │ │ 48-164 KB per SM │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Per-Device │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌────────────┐ │ │
│ │ │ Global Mem │ │ Constant │ │ Texture │ │ │
│ │ │ ~400 cycles │ │ Memory │ │ Memory │ │ │
│ │ │ 4-80 GB │ │ 64 KB │ │ (Cached) │ │ │
│ │ │ R/W │ │ Read-only │ │ Read-only │ │ │
│ │ └─────────────┘ └─────────────┘ └────────────┘ │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
| Memory Type | Location | Scope | Lifetime | Speed | Size |
|---|---|---|---|---|---|
| Register | On-chip | Thread | Thread | ~1 cycle | ~256 KB/SM |
| Shared | On-chip | Block | Block | ~5 cycles | 48-164 KB/SM |
| Global | Off-chip | Grid | Application | ~400 cycles | 4-80 GB |
| Constant | Off-chip (cached) | Grid | Application | ~5 cycles* | 64 KB |
| Local | Off-chip | Thread | Thread | ~400 cycles | - |
*When cached; ~400 cycles on cache miss
- Largest memory space on GPU
- Slowest access (~400-800 cycles)
- Accessible by all threads in all blocks
- Persists for lifetime of application
- Allocated with
cudaMalloc(), freed withcudaFree()
For best performance, adjacent threads should access adjacent memory locations:
// GOOD: Coalesced access
// Threads 0,1,2,3 access data[0,1,2,3]
__global__ void coalesced(int* data) {
int idx = threadIdx.x;
data[idx] = data[idx] + 1;
}
// BAD: Strided access
// Threads 0,1,2,3 access data[0,32,64,96]
__global__ void strided(int* data) {
int idx = threadIdx.x * 32; // Stride of 32
data[idx] = data[idx] + 1;
}Coalesced Access (GOOD):
Thread: 0 1 2 3 4 5 6 7
│ │ │ │ │ │ │ │
Memory: ┌─┬──┬──┬──┬──┬──┬──┬──┐
│0│ 1│ 2│ 3│ 4│ 5│ 6│ 7│
└─┴──┴──┴──┴──┴──┴──┴──┘
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
└────┴────┴────┴────┴────┴────┴────┘
ONE memory transaction
Strided Access (BAD):
Thread: 0 1 2
│ │ │
Memory: ┌──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┐
│ 0│ │ │ │ │32│ │ │ │ │64│ │ │
└──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┴──┘
▲ ▲ ▲
└──────────────┴──────────────┘
MULTIPLE memory transactions
// 2D array stored in 1D memory (row-major)
// Element at (row, col) = array[row * width + col]
__global__ void process2D(int* data, int width, int height) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col < width && row < height) {
int idx = row * width + col; // Row-major index
data[idx] = data[idx] * 2;
}
}- Fast on-chip memory (~5 cycles)
- Shared by all threads in a block
- Limited size (48-164 KB per SM)
- Lifetime of thread block
- Must be explicitly managed
__global__ void kernel() {
__shared__ int sharedData[256]; // Static: size known at compile time
sharedData[threadIdx.x] = threadIdx.x;
__syncthreads();
// Use shared data...
}extern __shared__ int dynamicShared[]; // Declare without size
__global__ void kernel(int* data, int n) {
// dynamicShared now points to shared memory
dynamicShared[threadIdx.x] = data[threadIdx.x];
__syncthreads();
// ...
}
// Launch with shared memory size
kernel<<<blocks, threads, n * sizeof(int)>>>(data, n);__global__ void reduce(int* input, int* output, int n) {
__shared__ int sdata[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Load from global to shared memory
sdata[tid] = (idx < n) ? input[idx] : 0;
__syncthreads();
// Reduction in shared memory
for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
if (tid < stride) {
sdata[tid] += sdata[tid + stride];
}
__syncthreads();
}
// Write result
if (tid == 0) {
output[blockIdx.x] = sdata[0];
}
}Shared memory is divided into 32 banks. Simultaneous access to different addresses in the same bank causes bank conflicts:
Shared Memory Banks (32 banks):
┌────┬────┬────┬────┬────┬────┬────┬────┐
│ B0 │ B1 │ B2 │ B3 │ B4 │...│B30 │B31 │
├────┼────┼────┼────┼────┼────┼────┼────┤
│ 0 │ 1 │ 2 │ 3 │ 4 │...│ 30 │ 31 │ ◄─ Address 0-31
├────┼────┼────┼────┼────┼────┼────┼────┤
│ 32 │ 33 │ 34 │ 35 │ 36 │...│ 62 │ 63 │ ◄─ Address 32-63
├────┼────┼────┼────┼────┼────┼────┼────┤
│ 64 │ 65 │ 66 │ 67 │ 68 │...│ 94 │ 95 │ ◄─ Address 64-95
└────┴────┴────┴────┴────┴────┴────┴────┘
No conflict: 2-way conflict:
Thread 0 → Bank 0 Thread 0 → Bank 0 (addr 0)
Thread 1 → Bank 1 Thread 1 → Bank 0 (addr 32) ← Same bank!
Thread 2 → Bank 2 Thread 2 → Bank 1
... ...
// Potential bank conflict with stride of 32
__shared__ float data[32][32];
float val = data[threadIdx.x][0]; // All threads access same bank!
// Solution: Add padding
__shared__ float data[32][33]; // 33 instead of 32
float val = data[threadIdx.x][0]; // Now different banks- 64 KB total
- Read-only from device code
- Cached - very fast when all threads read same address
- Broadcast capability - one read serves all threads
- Best for data that doesn't change during kernel execution
// Declare in global scope
__constant__ float constData[256];
// Initialize from host
float hostData[256] = {...};
cudaMemcpyToSymbol(constData, hostData, sizeof(hostData));
// Use in kernel
__global__ void kernel(float* output) {
int idx = threadIdx.x;
output[idx] = constData[idx] * 2.0f; // Read from constant memory
}- Lookup tables
- Kernel parameters
- Coefficients for convolution
- Configuration values
// Example: Convolution kernel in constant memory
__constant__ float kernel[9]; // 3x3 convolution kernel
__global__ void convolve(float* input, float* output, int width, int height) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col > 0 && col < width-1 && row > 0 && row < height-1) {
float sum = 0.0f;
for (int i = -1; i <= 1; i++) {
for (int j = -1; j <= 1; j++) {
sum += input[(row+i)*width + (col+j)] * kernel[(i+1)*3 + (j+1)];
}
}
output[row*width + col] = sum;
}
}┌─────────────────────────────────────────────────────────────┐
│ Memory Selection Guide │
└─────────────────────────────────────────────────────────────┘
┌───────────────────┐
│ What data? │
└─────────┬─────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Per-thread│ │ Shared │ │ Global │
│ data │ │ data │ │ data │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Registers │ │ Shared │ │ Global │
│ │ │ Memory │ │ Memory │
└───────────┘ └───────────┘ └───────────┘
│
▼
┌───────────────────┐
│ Read-only across │
│ all threads? │
└─────────┬─────────┘
│
┌─────────┴─────────┐
│ Yes │ No
▼ ▼
┌───────────┐ ┌───────────┐
│ Constant │ │ Keep in │
│ Memory │ │ Shared │
└───────────┘ └───────────┘
Perfect for GPU - no communication needed between threads.
// Vector operations, ReLU, scaling, etc.
__global__ void elementwise(float* A, float* B, float* C, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
C[idx] = A[idx] + B[idx];
}
}Combine all elements into single value using shared memory.
__global__ void reduce(float* input, float* output, int n) {
__shared__ float sdata[256];
int tid = threadIdx.x;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
sdata[tid] = (idx < n) ? input[idx] : 0;
__syncthreads();
for (int s = blockDim.x/2; s > 0; s >>= 1) {
if (tid < s) sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid == 0) output[blockIdx.x] = sdata[0];
}Use shared memory to cache neighboring elements.
__global__ void stencil1D(float* in, float* out, int n) {
__shared__ float s[BLOCK_SIZE + 2]; // Include halo
int gidx = blockIdx.x * blockDim.x + threadIdx.x;
int lidx = threadIdx.x + 1; // Leave room for halo
// Load main data
s[lidx] = in[gidx];
// Load halo elements
if (threadIdx.x == 0 && gidx > 0)
s[0] = in[gidx - 1];
if (threadIdx.x == blockDim.x - 1 && gidx < n - 1)
s[lidx + 1] = in[gidx + 1];
__syncthreads();
// Compute stencil
if (gidx > 0 && gidx < n - 1)
out[gidx] = 0.25f * s[lidx-1] + 0.5f * s[lidx] + 0.25f * s[lidx+1];
}Use 2D indexing and tiled algorithms for large matrices.
__global__ void matrixAdd(float* A, float* B, float* C, int width, int height) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (col < width && row < height) {
int idx = row * width + col;
C[idx] = A[idx] + B[idx];
}
}
// Launch
dim3 threads(16, 16);
dim3 blocks((width+15)/16, (height+15)/16);
matrixAdd<<<blocks, threads>>>(A, B, C, width, height);- Maximize parallelism: Use many threads
- Coalesce memory access: Adjacent threads → adjacent memory
- Use shared memory: For data reuse within a block
- Minimize divergence: Group similar work together
- Overlap computation and memory: Use streams
- Don't transfer unnecessarily: Minimize host↔device transfers
- Don't use small blocks: At least 128-256 threads per block
- Don't ignore occupancy: More active warps hide latency
- Don't serialize with atomics: Use reduction patterns instead
- Don't forget bounds checking: Always verify index < size
// Allocation
cudaMalloc(&d_ptr, size);
cudaFree(d_ptr);
// Transfer
cudaMemcpy(dst, src, size, direction);
// direction: cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost
// Constant memory
cudaMemcpyToSymbol(constVar, hostPtr, size);kernel<<<numBlocks, threadsPerBlock>>>(args);
kernel<<<numBlocks, threadsPerBlock, sharedMemSize>>>(args);
// 2D
dim3 blocks(bx, by);
dim3 threads(tx, ty);
kernel<<<blocks, threads>>>(args);// 1D
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// 2D
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int idx = row * width + col; // Row-major
// 2D block-local
int localIdx = threadIdx.y * blockDim.x + threadIdx.x;__syncthreads(); // Block-level sync
atomicAdd(&var, val); // Atomic operationsThis guide covered the essential concepts for CUDA programming:
- Heterogeneous Computing: CPU+GPU working together
- Programming Model: Kernels, threads, blocks, grids
- Execution Model: Warps, SMs, scheduling
- Memory Hierarchy: Global, shared, constant, registers
- Common Patterns: Element-wise, reduction, stencil, matrix
With these fundamentals, you're ready to solve problems on the CUDA Online Judge!
This document is part of the CUDA Online Judge educational materials.