Matrix Performance Benchmark

A comprehensive benchmark for comparing matrix multiplication and transpose performance across different GPU architectures and implementation methods.

Features

Multiple matrix multiplication implementations:
- Naive CUDA implementation
- Shared memory optimization
- Tensor cores (WMMA) implementation
- cuBLAS implementation
- CUTLASS implementation
Multiple matrix transpose implementations:
- Naive, shared memory, swizzled, and vectorized CUDA implementations
- cuBLAS transpose (SGEAM)
- CuTe-based transpose kernels (Naive, Shared, Swizzled)
Cross-GPU architecture testing (RTX 2080 Ti, A100, H100)
Various matrix size testing:
- Small (32×32)
- Medium (1024×1024)
- Large (8192×8192)
- Non-square (1024×2048×1024)
Performance measurements in GFLOPs for multiplications
Performance measurements in GB/s for transpose
Automated chart generation

Building the Multiplication Benchmark

# Basic compilation
nvcc -o matrix_multiplication ./profiling/matrix_multiplication.cu -lcublas -lcusparse -O3 -arch=sm_70

# For CUTLASS support
nvcc -o matrix_multiplication ./profiling/matrix_multiplication.cu -lcublas -lcusparse -I/path/to/cutlass/include -O3 -arch=sm_70

Usage Instructions

Running Locally

# Run on all available GPUs
./matrix_multiplication

# Run on a specific GPU type
./matrix_multiplication --gpu "RTX 2080 Ti"
./matrix_multiplication --gpu "A100"
./matrix_multiplication --gpu "H100"

Running on Specific GPU Types with Direct Access

Option 1: Use CUDA_VISIBLE_DEVICES environment variable

# Run only on the first GPU in the system
CUDA_VISIBLE_DEVICES=0 ./matrix_multiplication

# Run only on the second GPU
CUDA_VISIBLE_DEVICES=1 ./matrix_multiplication

Option 2: SSH into specific machines

# SSH to machine with RTX 2080 Ti
ssh rtx2080ti-node "cd /path/to/benchmark && ./matrix_multiplication"

# SSH to machine with A100
ssh a100-node "cd /path/to/benchmark && ./matrix_multiplication"

# SSH to machine with H100
ssh h100-node "cd /path/to/benchmark && ./matrix_multiplication"

Running on UVA HPC with Slurm

# Submit the job to the Slurm queue
sbatch run_benchmark.sh

# Use Slurm job arrays to run on multiple GPU types
sbatch --array=0-2 run_benchmark.sh

Generating Performance Charts

python3 generate_charts.py matrix_mul_performance.csv

Charts will be saved to the charts/ directory, showing performance comparisons across GPU architectures, matrix sizes, and implementation methods.

Adding CUTLASS Support

The repository includes a full CUTLASS implementation that demonstrates how to leverage NVIDIA's CUTLASS library for high-performance matrix multiplication. To use the CUTLASS implementation:

Make sure you have the CUTLASS library installed
Include the CUTLASS headers in your compilation
The implementation supports both standard GEMM operations and tensor core operations when available

Results

The benchmark produces a CSV file with performance results that includes:

GPU name
Implementation type
Matrix dimensions
Execution time (ms)
Throughput (GFLOPs)

The chart generation script visualizes these results to show:

Relative performance across GPU architectures
Performance scaling with matrix size
Comparison between different implementation methods

Charts will be saved to the charts/ directory, showing performance comparisons across:

Different GPU architectures (RTX 2080 Ti, A100, H100) Various matrix sizes (32×32, 1024×1024, 8192×8192, 1024×2048) All implementation methods (Naive, Shared Memory, Tensor Cores, cuBLAS, CUTLASS)

Building the Transpose Benchmark

Before building, ensure you have the CUTLASS library downloaded and available in ~/libraries/cutlass/include.

Manually compile the transpose binary per GPU architecture:

# RTX 2080 Ti (sm_75)
nvcc -o scripts/matrix_transpose_RTX2080 profiling/matrix_transpose.cu -lcublas -I$HOME/libraries/cutlass/include -O3 -arch=sm_75

# A100 (sm_80)
nvcc -o scripts/matrix_transpose_A100 profiling/matrix_transpose.cu -lcublas -I$HOME/libraries/cutlass/include -O3 -arch=sm_80

# H100 (sm_90)
nvcc -o scripts/matrix_transpose_H100 profiling/matrix_transpose.cu -lcublas -I$HOME/libraries/cutlass/include -O3 -arch=sm_90

Usage Instructions

Running Locally

# Run specific transpose binary based on GPU
./scripts/matrix_transpose_RTX2080
./scripts/matrix_transpose_A100
./scripts/matrix_transpose_H100

Running on UVA HPC with Slurm

Submit jobs using job arrays:

# Submit all 3 jobs (RTX2080, A100, H100)
sbatch --array=0-2 scripts/run_transpose.slurm

Or target a specific GPU:

# For RTX 2080 Ti (array ID 0)
sbatch --gres=gpu:nvidia_geforce_rtx_2080_ti:1 --array=0 scripts/run_transpose.slurm

# For A100 (array ID 1)
sbatch --gres=gpu:nvidia_a100-pcie-40gb:1 --array=1 scripts/run_transpose.slurm

# For H100 (array ID 2)
sbatch --gres=gpu:nvidia_h100_nvl:1 --array=2 scripts/run_transpose.slurm

Note: If the H100 is unavailable, the job will remain pending. Results will be added once the node is accessible.

Transpose Results

Each GPU run generates a file like:

results/transpose/matrix_transpose_<GPU>_performance.csv

Each row includes:

GPU name
Implementation (Naive, Shared, Swizzled, etc.)
Matrix size (MxN)
Execution time (ms)
Throughput (GB/s)
Verification status

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
plots		plots
profiling		profiling
results		results
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Matrix Performance Benchmark

Features

Building the Multiplication Benchmark

Usage Instructions

Running Locally

Running on Specific GPU Types with Direct Access

Option 1: Use CUDA_VISIBLE_DEVICES environment variable

Option 2: SSH into specific machines

Running on UVA HPC with Slurm

Generating Performance Charts

Adding CUTLASS Support

Results

Building the Transpose Benchmark

Usage Instructions

Running Locally

Running on UVA HPC with Slurm

Transpose Results

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

mmjerge/cudatorch

Folders and files

Latest commit

History

Repository files navigation

Matrix Performance Benchmark

Features

Building the Multiplication Benchmark

Usage Instructions

Running Locally

Running on Specific GPU Types with Direct Access

Option 1: Use CUDA_VISIBLE_DEVICES environment variable

Option 2: SSH into specific machines

Running on UVA HPC with Slurm

Generating Performance Charts

Adding CUTLASS Support

Results

Building the Transpose Benchmark

Usage Instructions

Running Locally

Running on UVA HPC with Slurm

Transpose Results

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages