Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vzeroupper support for x86 #225

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

chen-hu-97
Copy link

Refer to "15.3 MIXING AVX CODE WITH SSE CODE" Intel Software Optimization manual, "if software inter-mixes AVX and SSE instructions without using VZEROUPPER properly, it can experience an AVX/SSE transition penalty."

This patch add support for vzeroupper so JIT can emit this instruction and place it in correct place.

Refer to "15.3 MIXING AVX CODE WITH SSE CODE" Intel Software
Optimization manual, "if software inter-mixes AVX and SSE instructions
without using VZEROUPPER properly, it can experience an AVX/SSE
transition penalty."

This patch add support for vzeroupper so JIT can emit this instruction
and place it in correct place.

Signed-off-by: Chen Hu <[email protected]>
@chen-hu-97
Copy link
Author

I am not sure if such a single patch, add support for one instruction on x86, can be a PR.
However, hope I can ramp up the repo and contribute to AVX2 code generation/emit in coming months.

@zherczeg
Copy link
Owner

This kind of thing does not really fit to the concept of a generic jit compiler, because it more of a special casing for a specific issue. It would be interesting to explore other options as well, such as zeroing the target registers directly with a xor operation, since it could be generated without any extra api call. It is also a big question for me whether avx will ever be faster than sse2. If not, this direction does not really worth the effort.

@carenas
Copy link
Contributor

carenas commented Jan 5, 2024

something that might be interesting would be to mix this logic with some other CPU specific codepaths that could be dynamically patched in.

I am sure (for example) that in really modern CPUs with a highly performant AVX-512 circuit the AVX2 code should be able to perform better than SSE2.

Indeed once CPUs with the next variable vector implementation (AVX10) are out, that dynamically changes the vector size, using AVX would be definitely be faster as it will allow also for AVX-512 to work without any changes on the code.

@zherczeg
Copy link
Owner

zherczeg commented Jan 6, 2024

For the next release I decided to use the SSE2 code path. But it would be good to improve the vector use. I tried some methods, such as zeroing the 256 bit register before using it, but it was still slow. And moving the vzeroupper to somewhere else also has no effect. Overall I don't really understand what is exactly happening here, which makes hard to maintain / redesign the code. If feels "fragile", and if you want to play around with some new idea, you might just break it, and you don't even understand why. On the long run we need to understand how the cpu thinks, and how can we exploit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants