Comparison of learning and inference speed of different GPU with various CNN models in pytorch List of tested AMD and NVIDIA GPUs:
Following benchmark results has been generated with the command: ./show_benchmarks_resuls.sh Graph shows the 7700S results both with the pytorch 2.3.1 and with pytorch 2.4.0. ROCM SDK builders pytorch 2.4.0 contains the optimized flashattention support for AMD RX 7700S. (and other gfx1100/gfx1101/gfx1102 and gfx1103 cards)
This command will use pytorch to search all GPUs and will then run the benchmark for each of them separately and then in the end the benchmark that uses all of the GPUs
./run_benchmarks.sh
This command shows how to execute the benchmark for single gpu by using the -i parameter.
python3 benchmark_models.py -i 1 -g 1
First GPU has index 0, second 1, etc...
- New Results are stored under "new_results" folder
- Existing old results are under results folder
- After running the benchmarks, you can create a pull request to github to request to get them merged
- You can view the results of new benchmarks by addings the name of it's result file to plot_benchmarks.py and then running the show_benchmarks.sh script.
- AMD_Radeon_RX_6800
- AMD_Radeon_RX_7900_XTX
- AMD_Radeon_RX_7700S (Framework 16 laptop discrete GPU)
- AMD_Radeon_780M (Framework 16 laptop iGPU)
- 1080TI
- TITAN XP
- TITAN V
- 2080TI
- Titan RTX
- RTX 2060
- RTX 3090
- A100-PCIE
- A100-SXM4
Graphics Card Name | GTX 1080 Ti | TITAN XP | TITAN V | RTX 2060 | RTX 2080 Ti | TITAN RTX | A100-PCIE | RTX 3090 |
---|---|---|---|---|---|---|---|---|
Process | 16nm | 16nm | 12nm | 12nm | 12nm | 12nm | 7nm | 8 nm |
Die Size | 471mm² | 471mm² | 815mm² | 445 mm² | 754mm² | 754mm² | 826mm² | 628 mm² |
Transistors | 11,800 million | 11,800 million | 21,100 million | 10,800 million | 18,600 million | 18,600 million | 54,200 million | 28,300 million |
CUDA Cores | 3584 Cores | 3840 Cores | 5120 Cores | 1920 Cores | 4352 Cores | 4608 Cores | 6912 Cores | 10496 Cores |
Tensor Cores | None | None | 640 Cores | 240 | 544 Cores | 576 Cores | 432 Cores | 328 Cores |
Clock(base) | 1481 MHz | 1405 Mhz | 1200 MHz | 1365 MHz | 1350 MHz | 1350 MHz | 765 MHz | 1395 MHz |
FP16 (half) | 177.2 GFLOPS | 189.8GFLOPS | 29,798 GFLOPS | 12.90 TFLOPS | 26,895 GFLOPS | 32.62 TFLOPS | 77.97 TFLOPS | 35.58 TFLOPS |
FP32 (float) | 11,340 GFLOPS | 12.15FLOPS | 14,899 GFLOPS | 6.451 TFLOPS | 13,448 GFLOPS | 16.31 TFLOPS | 19.49 TFLOPS | 35.58 TFLOPS |
FP64 (double) | 354.4 GFLOPS | 379.7 GFLOPS | 7,450 GFLOPS | 201.6 GFLOPS | 420.2 GFLOPS | 509.8 GFLOPS | 9.746 TFLOPS | 556.0 GFLOPS |
Memory | 11GB GDDR5X | GDDR5X | 12 GB HBM2 | 6GB GDDR6 | 11 GB GDDR6 | 24 GB GDDR6 | 40GB HBM2e | 24GB GDDR6X |
Memory Interface | 352-bit | 384bit | 3072-bit | 192 bit | 352-bit | 384 bit | 5120 bit | 384 bit |
Memory Bandwidth | 484 GB/s | 547.6GB/s | 653GB/s | 336.0 GB/s | 616 GB/s | 672.0 GB/s | 1,555 GB/s | 936.2 GB/s |
Price | $699 US | $1,199 US | $2,999 US | $ 349 US | $1,199 US | $2,499 US | $ 1,499 USD | |
Release Date | Mar 10th, 2017 | Apr 6th 2017 | Dec 7th, 2017 | Jan 7th, 2019 | Sep 20th, 2018 | Dec 18th, 2018 | Jun 22nd, 2020 | Sep 1st, 2020 |
-
Single & multi GPU with batch size 12: compare training and inference speed of **SequeezeNet, VGG-16, VGG-19, ResNet18, ResNet34, ResNet50, ResNet101, ResNet152, DenseNet121, DenseNet169, DenseNet201, DenseNet161 mobilenet mnasnet ... **
-
Experiments are performed on three types of the datatype. single-precision, double-precision, half-precision
-
making plot(plotly)
././run_benchmarks.sh
- python>=3.6(for f-formatting)
- torchvision
- torch>=1.0.0
- pandas
- psutil
- plotly(for plot)
- cufflinks(for plot)
- Pytorch version
2.3
- Number of GPUs on current device
4
- CUDA version =
10.0
- CUDNN version=
7601
nvcr.io/nvidia/pytorch:20.10-py3
(docker container in A100 and 3090)
- 2024/07/22
- benchmarks can now be run also on AMD gpus
- ./run_benchmarks.sh script uses now pytorch to query the gpu count and will first run the tests for each device separately and then by using all GPU's simultaneously
- new benchmark results are saved to new_results/<gpu_index>/<gpu_name> folder
- added new "-i" option which can be used to specify which GPU to use
- If gpu index is not specified with -i option but the total gpu count specified by -g option > 1, then the tests will be run in a way that it uses all gpus simultaneously
- 2021/02/27
- Addition result in RTX3090
- Addition result in RTX2060(thanks for gutama)
- 2021/01/07
- Addition result in TITANXP
- 2021/01/05
- Addition result in A100 A100-PCIE(PR#14)
- 2021/01/04
- Addition result in A100 SXM4
- Addition result in TitanRTX
- Edit coding style benchmark_model
- f-formatting
- save option for json
- Edit test.sh for bash shell
- Edit README.md
- 2020/09/01
- Addition result in windows10
- Edit README.md
- 2020/01/17
- Edit coding style and some bug
- Change plot method
- Add results of various model experiments(only 2080ti)
- 2019/01/09
- PR Update typo (thanks for johmathe)
- Add requirements.txt
- Add result figures
- Add ('TkAgg') for cli
- Addition Muilt GPUS (DGX-station)
- 2021/02/27
- 2021/02/27 thanks for gutama issue#16
- 2021/01/05 thanks for kirk86 pr#14
- 2021/01/05 Thanks for kirk86 pr#14
- 2021/01/04
- 2021/01/04
- thanks for olixu
- based on 2020/01/17 update
Each network is fed with 12 images with 224x224x3 dimensions. For training, time durations of 20 passes of forwarding and backward are averaged. For inference, time durations of 20 passes of forwarding are averaged. 5 warm-up steps are performed that do not calculate towards the final result.
I conducted the experiment using two RTX 2080ti.
Mode | gpu | precision | densenet121 | densenet161 | densenet169 | densenet201 | resnet101 | resnet152 | resnet18 | resnet34 | resnet50 | squeezenet1_0 | squeezenet1_1 | vgg16 | vgg16_bn | vgg19 | vgg19_bn |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Training | TITAN V | single | 56.17 ms | 120.7 ms | 72.59 ms | 93.35 ms | 84.59 ms | 119.5 ms | 16.69 ms | 28.27 ms | 50.54 ms | 15.30 ms | 9.857 ms | 72.85 ms | 80.95 ms | 85.55 ms | 94.42 ms |
Inference | TITAN V | single | 17.49 ms | 39.33 ms | 23.63 ms | 30.93 ms | 23.96 ms | 34.22 ms | 4.827 ms | 8.428 ms | 14.27 ms | 4.565 ms | 2.765 ms | 22.94 ms | 25.41 ms | 27.55 ms | 30.28 ms |
Training | TITAN V | double | 139.8 ms | 387.4 ms | 175.9 ms | 224.5 ms | 509.9 ms | 720.0 ms | 94.21 ms | 194.6 ms | 271.7 ms | 68.38 ms | 31.18 ms | 1463. ms | 1484. ms | 1993. ms | 2016. ms |
Inference | TITAN V | double | 47.68 ms | 170.5 ms | 60.73 ms | 78.43 ms | 317.7 ms | 448.6 ms | 60.26 ms | 129.9 ms | 159.8 ms | 42.37 ms | 11.95 ms | 1261. ms | 1266. ms | 1745. ms | 1751. ms |
Training | TITAN V | half | 43.79 ms | 75.16 ms | 57.53 ms | 70.88 ms | 47.82 ms | 67.43 ms | 10.48 ms | 17.19 ms | 29.08 ms | 13.15 ms | 9.390 ms | 36.03 ms | 46.84 ms | 41.16 ms | 52.65 ms |
Inference | TITAN V | half | 11.87 ms | 22.88 ms | 16.04 ms | 20.70 ms | 12.80 ms | 18.11 ms | 3.085 ms | 5.116 ms | 7.608 ms | 3.694 ms | 2.329 ms | 10.96 ms | 13.26 ms | 12.72 ms | 15.17 ms |
Training | 1080ti | single | 77.18 ms | 164.0 ms | 99.66 ms | 127.6 ms | 112.8 ms | 158.7 ms | 22.48 ms | 36.80 ms | 68.87 ms | 20.56 ms | 13.29 ms | 101.8 ms | 114.1 ms | 119.9 ms | 133.2 ms |
Inference | 1080ti | single | 23.53 ms | 51.53 ms | 31.82 ms | 41.73 ms | 33.02 ms | 47.02 ms | 6.426 ms | 10.97 ms | 20.17 ms | 7.174 ms | 4.370 ms | 33.73 ms | 37.25 ms | 39.95 ms | 44.12 ms |
Training | 1080ti | double | 779.5 ms | 2522. ms | 940.4 ms | 1196. ms | 2410. ms | 3546. ms | 463.3 ms | 969.9 ms | 1216. ms | 259.9 ms | 131.5 ms | 4227. ms | 4271. ms | 5475. ms | 5522. ms |
Inference | 1080ti | double | 47.68 ms | 275.2 ms | 1157. ms | 328.6 ms | 414.9 ms | 1080. ms | 1589. ms | 181.1 ms | 390.8 ms | 529.6 ms | 110.9 ms | 49.96 ms | 2094. ms | 2103. ms | 2775. ms |
Training | 1080ti | half | 43.79 ms | 70.00 ms | 148.4 ms | 89.43 ms | 113.6 ms | 151.0 ms | 219.5 ms | 21.00 ms | 34.84 ms | 76.24 ms | 19.60 ms | 13.18 ms | 91.60 ms | 105.9 ms | 108.1 ms |
Inference | 1080ti | half | 18.62 ms | 42.26 ms | 25.27 ms | 33.01 ms | 27.49 ms | 38.88 ms | 5.645 ms | 9.765 ms | 16.26 ms | 5.869 ms | 3.576 ms | 30.69 ms | 33.22 ms | 36.71 ms | 39.51 ms |
Mode | gpu | precision | resnet18 | resnet34 | resnet50 | resnet101 | resnet152 | densenet121 | densenet169 | densenet201 | densenet161 | squeezenet1_0 | squeezenet1_1 | vgg16 | vgg16_bn | vgg19_bn | vgg19 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Training | RTX 2080ti(1) | single | 16.36 ms | 28.44 ms | 49.63 ms | 81.40 ms | 115.1 ms | 57.69 ms | 75.18 ms | 91.69 ms | 112.7 ms | 14.49 ms | 9.108 ms | 75.86 ms | 85.42 ms | 98.43 ms | 88.05 ms |
Inference | RTX 2080ti(1) | single | 4.894 ms | 8.624 ms | 14.65 ms | 24.57 ms | 35.15 ms | 16.70 ms | 21.94 ms | 28.89 ms | 34.64 ms | 4.704 ms | 2.765 ms | 23.70 ms | 26.25 ms | 30.82 ms | 28.03 ms |
Training | RTX 2080ti(1) | double | 367.9 ms | 755.4 ms | 939.9 ms | 1844. ms | 2702. ms | 593.5 ms | 724.3 ms | 921.3 ms | 1916. ms | 187.8 ms | 94.99 ms | 3251. ms | 3277. ms | 4265. ms | 4238. ms |
Inference | RTX 2080ti(1) | double | 165.0 ms | 328.5 ms | 436.4 ms | 831.0 ms | 1196. ms | 213.8 ms | 266.0 ms | 339.5 ms | 910.7 ms | 82.71 ms | 35.79 ms | 1702. ms | 1708. ms | 2280. ms | 2274. ms |
Training | RTX 2080ti(1) | half | 13.17 ms | 22.25 ms | 35.46 ms | 57.50 ms | 81.38 ms | 51.11 ms | 66.88 ms | 80.20 ms | 88.37 ms | 17.87 ms | 35.75 ms | 53.16 ms | 63.06 ms | 72.75 ms | 61.95 ms |
Inference | RTX 2080ti(1) | half | 3.423 ms | 5.662 ms | 9.035 ms | 14.51 ms | 20.52 ms | 13.47 ms | 17.54 ms | 22.51 ms | 27.10 ms | 4.280 ms | 2.397 ms | 16.14 ms | 18.14 ms | 19.76 ms | 17.89 ms |
Training | RTX 2080ti(2) | single | 16.92 ms | 29.51 ms | 51.46 ms | 84.90 ms | 120.0 ms | 58.13 ms | 75.96 ms | 92.47 ms | 117.6 ms | 14.95 ms | 9.255 ms | 78.95 ms | 88.71 ms | 102.3 ms | 91.67 ms |
Inference | RTX 2080ti(2) | single | 5.107 ms | 8.976 ms | 15.18 ms | 25.60 ms | 36.60 ms | 17.02 ms | 22.40 ms | 29.46 ms | 36.72 ms | 4.852 ms | 2.786 ms | 24.76 ms | 27.25 ms | 32.05 ms | 29.27 ms |
Training | RTX 2080ti(2) | double | 381.9 ms | 781.5 ms | 971.6 ms | 1900. ms | 2777. ms | 610.6 ms | 744.7 ms | 948.1 ms | 1974. ms | 191.9 ms | 97.27 ms | 3317. ms | 3350. ms | 4357. ms | 4329. ms |
Inference | RTX 2080ti(2) | double | 171.8 ms | 341.7 ms | 449.5 ms | 849.5 ms | 1231. ms | 221.1 ms | 275.2 ms | 352.5 ms | 938.9 ms | 83.66 ms | 36.48 ms | 1715. ms | 1721. ms | 2294. ms | 2289. ms |
Training | RTX 2080ti(2) | half | 13.57 ms | 22.97 ms | 36.55 ms | 59.10 ms | 83.81 ms | 51.74 ms | 68.35 ms | 81.21 ms | 89.46 ms | 15.75 ms | 35.46 ms | 55.28 ms | 65.43 ms | 75.75 ms | 64.62 ms |
Inference | RTX 2080ti(2) | half | 3.520 ms | 5.837 ms | 9.272 ms | 14.93 ms | 21.13 ms | 13.38 ms | 18.71 ms | 22.40 ms | 26.82 ms | 4.446 ms | 2.406 ms | 16.29 ms | 17.91 ms | 20.90 ms | 19.14 ms |
- Results using codes prior to 2020/01/17
If you want to contribute to the experiment in an additional environment, please contribute to the result by subfolder in fig.