-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for b4_SSE2 batched mode. #1825
Conversation
@johnfea thanks, I think the question (possibly unanswered) was: With many of the noise and other functions doing internal SIMD using x,y,z or r,g,b,a to take advantage of SSE would running a Batched<4> be an improvement? The Batched<4> would execute the entire shader logic and other features in SIMD, so could be a faster. If you have any feedback on that we would love to hear it. Also please attempt to add a testing configuration that exercises batched<4> to |
Based on a quick benchmark, batched SSE2 was 20% to 40% faster in total render speed with low complexity material when a single material filled the screen. 20% was with a single 3D fBm and 40% was with mix of fBm and two different procedural shaders. It seems definitely an improvement. There was at least two maybe related issues, however. A string from a shader didn't end up with correct ustringhash in output closure and its string representation was empty string. Also, backfacing() in a shader didn't work correctly anymore. |
@johnfea great to hear it is speeding things up. Curious how Batched<8> does with AVX on the same workloads vs Batched<4> with SSE. As far as backfacing() not working, it should just pull a varying value out of the BatchShaderGlobals<4>::varing.backfacing. So make sure you are populating it with value values before executing the shadernetwork. As far as the closures, once you have added a b4_SSE2 workflow to the .github/workflows/ci.yml you can craft a failing unit test that reproduces the issue, you can update this pull request and we could take a look at the failing CI. |
b4_SSE2 ran at 60% the speed of b8_AVX2 mode with single 3D fBm when measured in total render time, but for this measurement all parts of rendering were switched and not just OSL part. I updated the pull request with failing closure test unit, but I've no idea how to make the tests run so it probably won't run without some changes. Since testshade didn't seem to read output closures and testrender doesn't support batched, I added new testminimal. This issue doesn't happen with printf from a shader like done in some testshade tests. It seems the backfacing() issue occurs with batched mode at any width but it doesn't affect output. When using backfacing() as input to a mix shader, non-batched mode doesn't output the cancelled shader in closures, but batched mode does with zero weights, which is missed optimization. Also, the issue reproduced by this unit test is also generic to batched mode like #1801 and #1800 and not just with SSE2. |
59297e6
to
450c96a
Compare
I've updated the pull request, now "closure-string" test fails in batched mode only and for some reason also for non-batched icc/icx. This is the only test that fails for b8_AVX2 but for b4_SSE2 there's also "Invalid bitcast" <4x i1> to i8. |
f79e03f
to
6b1276a
Compare
I removed closure-string and testminimal from this PR and added those into a separate PR #1831. Now for b4_SSE2 only "'Invalid bitcast' <4x i1> to i8" should fail in tests and b8_AVX2 should pass. |
c947bf3
to
6551e79
Compare
All tests for this PR pass now. I'm not sure why AVX512 needs cases for width 8 in src/liboslexec/llvm_util.cpp and why AVX needs cases for width 16. I tried removing them and EDIT: it made no difference for either b16 or b8 in tests. Anyway, I added cases in those places too for width 4. There's an alternative branch here with all of those cases removed. EDIT2: Apparently, b8 AVX512 is a path in liboslexec CMakeList.txt but b16 AVX is not. My take would be to not add any b4 paths/cases for AVX512 or AVX, but retain the existing non-typical width cases for AVX512 and AVX. |
@johnfea Reviving this, sorry for the delay, it's been a busy last several weeks. Would you mind doing a rebase on top of the current main so we get a clean merge? And other than that, is this patch ready to merge now? @AlexMWells Do you have any reservations about it? |
For the Files Changed it all looks good, just adds a 4 wide path utilizing all the same approaches. Want to see the CI run it though and make sure it passes the massive pile of regressions. Great Work! |
Signed-off-by: Tuomas Tonteri <[email protected]>
Signed-off-by: Tuomas Tonteri <[email protected]>
Yes, it can be merged now. One might want to change b4sse2 ci to be more different from b8avx2. Both old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though. |
@johnfea , can you elaborate or provide example of "old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though." |
Looking at CI action, I see which successfully executed in (SSE2 batch width 4) 60 different *.regress.batched.opt tests comparing results against scalar results, and another 161 *.batched.opt tests comparing against reference txt or images. @johnfea what isn't being covered such that "Both old and new non-typical width batched code in llvm_util.c isn't covered by testsuite though"? |
I mean what I discussed above. Here's the code structure:
Here's maybe all affected functions:
I haven't looked more into it. For me these are not important since I don't currently use e.g. 8-wide AVX512. I added the completely untested cases for 4 there to not change existing code layout. I wouldn't worry about it but just as a note. |
src/liboslexec/llvm_util.cpp
Outdated
llvm::Value* mask_as_int = mask4_as_int8(mask); | ||
|
||
// Count trailing zeros, least significant | ||
llvm::Type* types[] = { intMaskType }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to repeat the code outside the switch statement, I think you can just reduce this down to
llvm::Value* mask = mask4_as_int8(mask);
break;
and let the existing code after the switch handle the rest vs. repeating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@johnfea , I got it, so CI doesn't execute all combinations of 4,8,16 and SSE, AVX, AVX2, AVX512 ISA's so portions of llvm_util.cpp maybe untested.
|
8ed8976
to
7a08a8f
Compare
Signed-off-by: Tuomas Tonteri <[email protected]>
I think the test failures are because of a merge that just happened in OIIO master. I merged a fix into OSL main. If you rebase on top of the current main, I think that will all get cleaned up. |
Fixed now, sorry about that. One more merge/rebase on top of the (new) main ought to clear up the CI for you. |
Looks like all the CI tests are coming in passing now. I'm inclined to merge this in its current state, and any further enhancements or tests can be a follow-up. Express objections now; in their absense, I will merge this tomorrow (Monday). |
This one needs to be accepted before I can work on the others. Due to changes there's again trivial merge conflict in ci.yml. |
Signed-off-by: Larry Gritz <[email protected]>
I'm going to merge this, and then I have a fix for the CI that I will push immediately following that. |
Next time, please submit PRs from a branch other than your own main, that can make things very difficult. |
Signed-off-by: Larry Gritz <[email protected]>
Description
Add support for b4_SSE2 batched mode, enabling batched execution for all x86-64 CPUs that don't support AVX.
Is there interest in adding this into OSL? I still support CPUs with vector width of 4 in my project that uses OSL and it would be nice to not to have to resort to scalar processing.
Tests
Quick tests were run to see that output with procedural and texture based materials looked ok and proper SSE2 batched code was being generated for wide/ functions. EDIT: Additionally, all tests pass now.
Checklist:
already run clang-format v17 before submitting, I definitely will look at
the CI test that runs clang-format and fix anything that it highlights as
being nonconforming.