You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that the 'swizzle' function call actually has extra CPU instructions generated - see the dot4Old function in this godbolt and play around with the commented out line and the one next to it.
By changing cross3 to use shuffle this seems to help the benchmark:
I'm consistently seeing scalar being faster on M1 mac, with -Doptimize=ReleaseFast
Example:
cross3, dot3, scale, bias benchmark (AOS) - scalar version: 0.9780s, zmath version: 1.0045s
I noticed that the 'swizzle' function call actually has extra CPU instructions generated - see the dot4Old function in this godbolt and play around with the commented out line and the one next to it.
By changing
cross3
to use shuffle this seems to help the benchmark:I recommend changing this everywhere. Also the dot2 is weird... there are a lot of potential perf improvements in the zmath area.
The text was updated successfully, but these errors were encountered: