Add BatchNorm kernel for ROCm #9014

mindest · 2021-09-09T11:52:16Z

Description:

Implement BatchNormInternal and BatchNormalizationGrad for ROCm EP
- support float/float16, as MIOpen kernel does not support double yet
- default mode spatial as per onnx spec
Update BatchNorm CUDA test for ROCm test too

Motivation and Context

Why is this change required?
1-P model TwinBERT needs this for training on Mi100 clusters.

* Add BatchNorm kernel for ROCm, update BN test * correct epsilon_ setting; limit min epsilon

* Revert "Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101)" This reverts commit 4788839. * Add BatchNorm kernel for ROCm (#9014) * Add BatchNorm kernel for ROCm, update BN test * correct epsilon_ setting; limit min epsilon * Upgrade ROCm CI pipeline for ROCm 4.3.1 and permit run inside container (#9070) * try to run inside 4.3.1 container * no \ in container run command * remove networking options * try with adding video render groups * add job to build docker image * try without 1st stage * change alpha, beta to float * try adding service connection * retain huggingface directory * static video and render gid * use runtime expression for variables * install torch-ort * pin sacrebleu==1.5.1 * update curves for rocm 4.3.1 * try again * disable determinism and only check tail of loss curve and with a much larger threshold of 0.05 * disable RoBERTa due to high run variablity on ROCm 4.3.1 * put reduction unit tests back in * Fix nightly CI pipeline to generate ROCm 4.2 wheels and add ROCm 4.3.1 wheels (#9101) * make work for both rocm 4.2 and rocm 4.3.1 * fix rocm 4.3.1 docker image reference * fix CUDA_VERSION to ROCM_VERSION * fix ReduceConsts conflict def * add ifdef to miopen_common.h as well * trailing ws Co-authored-by: wangye <[email protected]> Co-authored-by: mindest <[email protected]>

Add BatchNorm kernel for ROCm, update BN test

5018215

mindest marked this pull request as ready for review September 10, 2021 03:36

mindest requested a review from weixingzhang September 10, 2021 03:36

mindest added the training issues related to ONNX Runtime training; typically submitted using template label Sep 10, 2021

correct epsilon_ setting; limit min epsilon

7d16a9e

weixingzhang approved these changes Sep 13, 2021

View reviewed changes

mindest merged commit a1021a1 into master Sep 13, 2021

mindest deleted the linmin/bn_rocm branch September 13, 2021 07:15

suffiank pushed a commit that referenced this pull request Sep 21, 2021

Add BatchNorm kernel for ROCm (#9014)

e7dd68f

* Add BatchNorm kernel for ROCm, update BN test * correct epsilon_ setting; limit min epsilon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BatchNorm kernel for ROCm #9014

Add BatchNorm kernel for ROCm #9014

mindest commented Sep 9, 2021

Add BatchNorm kernel for ROCm #9014

Add BatchNorm kernel for ROCm #9014

Conversation

mindest commented Sep 9, 2021