A collection of CUDA optimization techniques focusing on Matrix Multiplication and Parallel Reduction algorithms.
This repository demonstrates various GPU optimization techniques through practical implementations and analysis.
-
Matrix Multiplication Optimizations
- Baseline implementation
- Progressive optimization steps
- Performance analysis on Tesla T4 and V100
-
Parallel Reduction Implementations
- 6 optimization versions
- Interactive Jupyter notebooks
- Performance comparisons
-
Detailed Performance Analysis
- Block configuration heatmaps
- Energy efficiency metrics
- Bandwidth utilization charts
- NVIDIA GPU with CUDA support
- CUDA Toolkit (latest version recommended)
- Python 3.x with packages:
- numpy
- matplotlib
- jupyter
- Detailed analysis for Tesla T4 and V100
- Block configuration impact studies
- Energy efficiency comparisons
- Bandwidth utilization metrics
/MatMul-Optimizations- Core matrix multiplication implementations/Parallel-Reduction- Reduction algorithm variations/Reduction-Profiling- Detailed performance analysis/Optimization-Results- Benchmark data and results
This project is licensed under the MIT License - see the LICENSE file for details.