Looking for info on NCHWc convoultion algorithm - why is performance so much better? #11747
Unanswered
philtomson
asked this question in
Other Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
(sorry, asked this over in Issues, but probably more appropriate here)
I'm curious about the NCHWc layout and convolution. If I use maximum optimization (-o 99) vs -o 2 the speedup is 2x (for a ResNet18 network). Looking at the output onnx I see that for the max optimization many ops get converted to NCHWc convolutions. adds and other mathematical ops get converted to pointwise NCHWc convolutions. Batchnorms get fused into conv ops.
Is NCHWc being used because it's more cache efficient? (ie fewer cache misses?) or are there other reasons why NCHWc convolutions perform much better? (ie the hand coded macro assember in SconvKernelAvx512F.S - is the algorithm there somehow fundamentally different and NCHWc is being used to avoid replicating handcoding this optimized algorithm for other memory layouts?)
Beta Was this translation helpful? Give feedback.
All reactions