From dc27d354c8194fa3c74a75cb905f67a7bc8877f2 Mon Sep 17 00:00:00 2001 From: Scott Davidson Date: Mon, 2 Oct 2023 16:24:37 +0100 Subject: [PATCH] Add Pytorch section to README --- README.md | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/README.md b/README.md index 77dcadf..aa30b92 100644 --- a/README.md +++ b/README.md @@ -13,6 +13,7 @@ - [RDMA Bandwidth](#rdma-bandwidth) - [RDMA Latency](#rdma-latency) - [fio](#fio) + - [Pytorch](#Pytorch) - [Operator development](#operator-development) ## Installation @@ -289,6 +290,41 @@ spec: storage: 5Gi ``` +### Pytorch + +Runs machine learning model training and inference micro-benchmarks from the official +Pytorch [benchmarks repo](https://github.com/pytorch/benchmark/) to compare performance +of CPU and GPU devices on synthetic input data. Running benchmarks on CUDA-capable +devices requires the [Nvidia GPU Operator](https://github.com/NVIDIA/gpu-operator) +to be pre-installed on the target Kubernetes cluster. + +The pre-built container image currently includes the `alexnet`, `resnet50` and +`llama` (inference only) models - additional models from the +[upstream repo list](https://github.com/pytorch/benchmark/tree/main/torchbenchmark/models) +may be added as needed in the future. (Adding a new model simply requires adding it to the list +in `images/pytorch-benchmark/Dockerfile` and updating the `PytorchModel` enum in `pytorch.py`.) + +```yaml +apiVersion: perftest.stackhpc.com/v1alpha1 +kind: Pytorch +metadata: + name: pytorch-test-gpu +spec: + # The device to run the benchmark on ('cpu' or 'cuda') + device: cuda + # Name of model to benchmark + model: alexnet + # Either 'train' or 'eval' + # (not all models support both) + benchmarkType: eval + # Batch size for generated input data + inputBatchSize: 32 + # Count defaults to 0 for device == cpu + # or 1 for device == cuda + gpuCount: 2 +``` + + ## Operator development ```