Add vzeroupper support for x86 #225

chen-hu-97 · 2023-12-25T07:25:43Z

Refer to "15.3 MIXING AVX CODE WITH SSE CODE" Intel Software Optimization manual, "if software inter-mixes AVX and SSE instructions without using VZEROUPPER properly, it can experience an AVX/SSE transition penalty."

This patch add support for vzeroupper so JIT can emit this instruction and place it in correct place.

Refer to "15.3 MIXING AVX CODE WITH SSE CODE" Intel Software Optimization manual, "if software inter-mixes AVX and SSE instructions without using VZEROUPPER properly, it can experience an AVX/SSE transition penalty." This patch add support for vzeroupper so JIT can emit this instruction and place it in correct place. Signed-off-by: Chen Hu <[email protected]>

chen-hu-97 · 2023-12-25T07:28:08Z

I am not sure if such a single patch, add support for one instruction on x86, can be a PR.
However, hope I can ramp up the repo and contribute to AVX2 code generation/emit in coming months.

zherczeg · 2023-12-25T12:32:42Z

This kind of thing does not really fit to the concept of a generic jit compiler, because it more of a special casing for a specific issue. It would be interesting to explore other options as well, such as zeroing the target registers directly with a xor operation, since it could be generated without any extra api call. It is also a big question for me whether avx will ever be faster than sse2. If not, this direction does not really worth the effort.

carenas · 2024-01-05T20:35:55Z

something that might be interesting would be to mix this logic with some other CPU specific codepaths that could be dynamically patched in.

I am sure (for example) that in really modern CPUs with a highly performant AVX-512 circuit the AVX2 code should be able to perform better than SSE2.

Indeed once CPUs with the next variable vector implementation (AVX10) are out, that dynamically changes the vector size, using AVX would be definitely be faster as it will allow also for AVX-512 to work without any changes on the code.

zherczeg · 2024-01-06T04:26:22Z

For the next release I decided to use the SSE2 code path. But it would be good to improve the vector use. I tried some methods, such as zeroing the 256 bit register before using it, but it was still slow. And moving the vzeroupper to somewhere else also has no effect. Overall I don't really understand what is exactly happening here, which makes hard to maintain / redesign the code. If feels "fragile", and if you want to play around with some new idea, you might just break it, and you don't even understand why. On the long run we need to understand how the cpu thinks, and how can we exploit it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vzeroupper support for x86 #225

Add vzeroupper support for x86 #225

chen-hu-97 commented Dec 25, 2023

chen-hu-97 commented Dec 25, 2023

zherczeg commented Dec 25, 2023

carenas commented Jan 5, 2024 •

edited

Loading

zherczeg commented Jan 6, 2024

Add vzeroupper support for x86 #225

Are you sure you want to change the base?

Add vzeroupper support for x86 #225

Conversation

chen-hu-97 commented Dec 25, 2023

chen-hu-97 commented Dec 25, 2023

zherczeg commented Dec 25, 2023

carenas commented Jan 5, 2024 • edited Loading

zherczeg commented Jan 6, 2024

carenas commented Jan 5, 2024 •

edited

Loading