-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[rlsw] Smarter texture management #5296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Bigfoot71 Looks great! It simplifies some part! Thanks! |
|
@Bigfoot71 Just testing this version with the |
|
I just added texture conversion at load time, they're converted to RGBA32 for now, and there's still a ton of float <-> int conversions. And I spoke too soon, of course the framebuffer doesn't support an alpha channel, and that causes problems for packed formats like RGB16/RGB8 if we want textures to use the same format... I'll think more about all this, but even this small change already simplified things a bit more, and I went from 750 bunnies to 950 bunnies before dropping below 60 FPS with the bunnymark example on my machine. It's not a very serious benchmark, but it gives an idea. |
Yes, I just cleaned up the framebuffer code to make it easier to modify. The logic remains the same for now at this level. |
Yeah, I did. Just synced when it came in.
That seem lot better than my results, are you using |
I'm using |
|
@raysan5 Quick question, regarding the framebuffer blit, can we remove the blit to floating-point formats? I mean, when we want to copy the result to an They're there mainly to make the implementation complete, but they don't seem useful as is, and they can simplify a lot of things. If you prefer to keep it so that it is complete, I will understand. |
|
Sorry, I was building it in Debug mode... Ok, now I'm building with VS2022 (MSVC), Windows 10 64bit, now in Release mode (O2) and getting better results: |
Sure! I doubt anyone is going to use floating-point channels with a software renderer! In any case, some log error can be shown if anyone tries to request that... |
|
@raysan5 Alright, I tried a bunch of stuff, and here's the report. The problem right now is that we can't (or rather, don't want to) work directly with integers for colors, because to reduce the load on interpolations we're currently doing increments instead of a full 'lerp' all the time during rasterization. That means we need to be able to work with fractions. So we have three options:
So really, there are only two real solutions in my opinion: floats or fixed-point. I've just done a huge amount of work removing all floating-point operations from the hot path (clipping/rasterization). I still have a few issues with clipping, which I can handle, but for now, using fixed-point, I dropped from 950 bunnies to 800 before falling under 60 FPS (yes, I like bunnies). So there's a performance hit on desktop PCs. My guess is that it's because the code becomes less vectorizable by the compiler (with O2/O3) when using fixed-point, plus there's the cost of shifts on every mul/div. However, if we want to run RLSW on devices without an FPU, it could be a good thing. I think with a bit of elbow grease and some smart tricks we could get back to the same performance as before (and probably more), but there will still be a penalty compared to all-float on desktop. Otherwise, we can just assume there will always be an FPU and a compiler capable of auto-vectorizing efficiently. So yeah, it's up to you. We have both options in my opinion, it depends on what you prefer for later or now, and which use cases you want to prioritize. If you want to go with fixed-point despite the desktop performance hit, everything’s almost ready, just the clipping is still acting up. And if anyone has any ideas or anything to say, I’m all ears! |
|
I forgot to specify, if in my test I ended up replacing everything with fixed point, it's because I initially did it only for the colors, but that still introduced a lot of float/fixed conversion, so I migrated everything. |
|
@Bigfoot71 Thanks for the detailed explanation and all the testing. Despite I like fixed-point (very old-school), I think the best approach for the future is using floats and expect a FPU available. Definitely compilers can do a great job at optimizing code if not mixing fixed-point with float operations. I think we can explore other possible optimization routes, did you detect some costly operation or bottleneck? I guess alpha blending should be costly, maybe allow a path with no alpha operations? texture fetching is usually another costly operation (at least on GPU side), what interpolation is using at the moment? And also the rasterization process for the triangles, probably some work can be done on that side... Just giving some random ideas. Obviously SIMD operations and multi-threading (tile-based rendering) would improve performance but it would imply specific hardware implementations and a considerable complexity cost... still, in current design, is vertex transform and pixel raster/blending vectorizable? @JeffM2501 What are your thoughts? |
|
Quick reply, we could test OpenMP for parallelization, not ideal for every case, but worth a try. Blending and sampling are costly too. Nearest filtering avoids interpolation, but UV wrapping can bottleneck a bit, hence the custom Specialized paths could help, though they're tricky to fit into the hot path. There's already one for axis-aligned quads (faster than two triangles), similar ideas might apply elsewhere. Edit: try not to flood the thread |
This adds SIMD framebuffer read/write, and also texel fetch. This supports SSE2/SSE41 and SISD fallback (also includes ARM NEON support, conceptually identical but still needs testing).
|
I added SIMD framebuffer read/write and texel fetch (SSE2/SSE41), and I also switched the framebuffer to RGBA32 by default. There's still packed RGB16 and RGB8 support, which doesn't benefit from SIMD. There's also ARM NEON support, which is normally conceptually identical, but it needs to be tested. It's normally fine, unless I missed something. With SSE2, I went from 950 to 1450 bunnies, and with SSE41 up to 1530 bunnies (before dropping below 60). It could be even more efficient if we could process multiple pixels at once, but it gets technical with scanlines. As for the rest, the blend mode could be rewritten quite easily in SIMD, but I think it's better to leave that to the compiler. The code is very simple here, but I could always run a test if necessary. There might be something to do for bilinear filtering, I'll think about it. I also have a SIMD matrix multiplication implementation aside, and other small things, I can add them, it's still a detail but it can help. (sorry for being long in my messages, I try not to leave anything out if anyone wants to help) |
Wow! Big improvement! Tested it on my system and now I'm getting ~700 bunnies @ 60 FPS and ~1500 bunnies @ 30 FPS! Amazing! Many 2D games (and tools UIs) do not need that many sprites on screeen!
No worries, I think those latests improvements are already a huge boost! I also have a SIMD matrix multiplication implementation aside, and other small things, I can add them, it's still a detail but it can help. Feel free, whenever you want. Thanks for the detailed explanations, impressive to see how everything improved! In the following days I'll do some comparison with official Mesa implementations ( I think we can merge current changes, let me know when ready. |
unrelated to the PR, but at least it's done
|
I made a few last adjustments to the code, including one that disables sampling if the texture is '0' We could also, when loading textures, check if at least one alpha value is lower than 255, and if during rendering all vertex colors are fully opaque, then disable blending if it's an alpha mix. But that might be going a bit too far, in cases like that just changing the pipeline state costs absolutely nothing anyway. Note that the default texture is kept at '0' because it is used in the case where a texture is generated but an image is not "uploaded" to it. You can merge, I think it's good to go. However, if there are better ideas regarding textures, the discussion could still be relevant. There will still be some minor issues with the rasterizer where the triangles jitter a bit. It shouldn't be too hard to fix, but it requires carefully reviewing the logic, I'll try to look into that later. |
|
@Bigfoot71 Thanks! Great improvements! Merged! |
Draft to work on the proposal here: #5294 (comment)
To start, I've simplified the framebuffer handling as much as possible to make it more modular and easier to modify.
The main idea now is to convert loaded textures to the same format as the framebuffer, reducing the color conversions needed for reading and writing.
Since the write destination is always the same, there's no need to keep the textures in their original format or support multiple formats at runtime.
I'll do an update once the texture conversion is done, and we'll see what else can be improved.
Any ideas or thoughts are welcome!