Investigate whether inlining of `matmul` and `matmul(transpose,...)` intrinsics provides better performance. @radu2k (https://github.com/radu2k) is investigating this.