-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardware acceleration #47
Comments
Nvm, 12-th gen had this for a brief moment added by accident, but then got removed from later revisions. This probably means it is going to be in other upcoming Core series CPUs. It was already a thing for Sapphire Rapids Xeons. I also opened an upstream issue. |
In my use case we have float16 tensor outputs from a NPU on the RK3588 (Arm processor). ARM does have NEON SIMD instructions to hardware accelerate the conversion from fp16 to fp32. We can't make use of those extensions with Go as the compiler does not support SIMD instructions. Via CGO you can interface with the ARM Compute library to make use of these instructions, however for our use case which involves converting 856,800 bytes from uint16->fp32 per video frame this is much slower than sticking with pure Go in this library. However better performance is still attainable by using a precalculated lookup table for the uint16->fp32 conversion. On the RK3588 we get a 35% performance improvement.
And on a Threadripper workstation we get a 69% improvement.
To create such a lookup table we are simply precalculating it in our application with.
Then converting our output buffer from uint16 to fp32 with.
|
@swdee Thanks for the suggestion! I will try this and see how it goes. |
@x448 We have a CGO version as worked with @TailsFanLOL and discussed here. |
Golang declined the request to add this type to the language itself. @swdee, perhaps we should improve what you have slapped together and merge it here, adding stuff from this project's wishlist while we are at it. My current concern in your fork is that |
I actually did an Assembly version using NEON instructions for ARM. Unfortunately you can't use Go's inline assembler as its instruction set does not support SIMD instructions on any platform, so I used it inline in C via CGO.
On the RK3588 the benchmark out of this is;
The speed is slightly better than the C version we talked about on the golang issue. In my own project I stuck with the C version purely for the reason its easier to deal with than the ASM version. |
Here is one for x86 using AVX2.
With results on my workstation:
|
That's great. Unfortunately NEON isn't IEEE compatible as it handles handles subnormals as equal to zero (with some other minor differences). |
Hey! Can this use hardware instructions for conversion? Intel CPUs support hardware conversion since 2013, and the new 12-th gen also has support for arithmetic (I think?). Other architectures had that a while ago.
This might be possible without compiler support using embedded C code, but wouldn't that be out of scope for this?
The text was updated successfully, but these errors were encountered: