Add Pytorch section to README

stackhpc · Oct 2, 2023 · dc27d35 · dc27d35
1 parent 7dcf327
commit dc27d35
Showing 1 changed file with 36 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -13,6 +13,7 @@
   - [RDMA Bandwidth](#rdma-bandwidth)
   - [RDMA Latency](#rdma-latency)
   - [fio](#fio)
+  - [Pytorch](#Pytorch)
 - [Operator development](#operator-development)
 
 ## Installation
@@ -289,6 +290,41 @@ spec:
         storage: 5Gi
 ```
 
+### Pytorch
+
+Runs machine learning model training and inference micro-benchmarks from the official 
+Pytorch [benchmarks repo](https://github.com/pytorch/benchmark/) to compare performance
+of CPU and GPU devices on synthetic input data. Running benchmarks on CUDA-capable
+devices requires the [Nvidia GPU Operator](https://github.com/NVIDIA/gpu-operator) 
+to be pre-installed on the target Kubernetes cluster.
+
+The pre-built container image currently includes the `alexnet`, `resnet50` and 
+`llama` (inference only) models - additional models from the 
+[upstream repo list](https://github.com/pytorch/benchmark/tree/main/torchbenchmark/models)
+may be added as needed in the future. (Adding a new model simply requires adding it to the list
+in `images/pytorch-benchmark/Dockerfile` and updating the `PytorchModel` enum in `pytorch.py`.)
+
+```yaml
+apiVersion: perftest.stackhpc.com/v1alpha1
+kind: Pytorch
+metadata:
+  name: pytorch-test-gpu
+spec:
+  # The device to run the benchmark on ('cpu' or 'cuda')
+  device: cuda
+  # Name of model to benchmark
+  model: alexnet
+  # Either 'train' or 'eval'
+  # (not all models support both)
+  benchmarkType: eval
+  # Batch size for generated input data
+  inputBatchSize: 32
+  # Count defaults to 0 for device == cpu
+  # or 1 for device == cuda
+  gpuCount: 2
+```
+
+
 ## Operator development
 
 ```