Skip to content

Conversation

@Bigfoot71
Copy link
Contributor

Draft to work on the proposal here: #5294 (comment)

To start, I've simplified the framebuffer handling as much as possible to make it more modular and easier to modify.

The main idea now is to convert loaded textures to the same format as the framebuffer, reducing the color conversions needed for reading and writing.

Since the write destination is always the same, there's no need to keep the textures in their original format or support multiple formats at runtime.

I'll do an update once the texture conversion is done, and we'll see what else can be improved.

Any ideas or thoughts are welcome!

@raysan5
Copy link
Owner

raysan5 commented Oct 23, 2025

@Bigfoot71 Looks great! It simplifies some part! Thanks!

@raysan5
Copy link
Owner

raysan5 commented Oct 23, 2025

@Bigfoot71 Just testing this version with the textures_bunnymark example and comparing with previous version. It seems the performance is mostly the same. On my system I got about 1100 bunnies @ 5 fps. Shouldn't there be some improvement? Maybe I'm doing something wrong... I'm using the new PLATFORM_DESKTOP_WIN32 backend, maybe hitting some bottleneck on blitting... 🤔

@Bigfoot71
Copy link
Contributor Author

I just added texture conversion at load time, they're converted to RGBA32 for now, and there's still a ton of float <-> int conversions.

And I spoke too soon, of course the framebuffer doesn't support an alpha channel, and that causes problems for packed formats like RGB16/RGB8 if we want textures to use the same format...

I'll think more about all this, but even this small change already simplified things a bit more, and I went from 750 bunnies to 950 bunnies before dropping below 60 FPS with the bunnymark example on my machine.

It's not a very serious benchmark, but it gives an idea.

@Bigfoot71
Copy link
Contributor Author

Just testing this version with the textures_bunnymark example and comparing with previous version. It seems the performance is mostly the same. On my system I got about 1100 bunnies @ 5 fps. Shouldn't there be some improvement? Maybe I'm doing something wrong... I'm using the new PLATFORM_DESKTOP_WIN32 backend, maybe hitting some bottleneck on blitting... 🤔

Yes, I just cleaned up the framebuffer code to make it easier to modify. The logic remains the same for now at this level.
I just overhauled the texture management, see my previous post. You can try it with the very latest commit I just pushed.

@raysan5
Copy link
Owner

raysan5 commented Oct 23, 2025

You can try it with the very latest commit I just pushed.

Yeah, I did. Just synced when it came in.

I went from 750 bunnies to 950 bunnies before dropping below 60 FPS

That seem lot better than my results, are you using PLATFORM_DESKTOP_WIN32 backend or PLATFORM_DESKTOP_SDL? In any case, I'm testing on a quite limited laptop:

Processor: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
RAM:       16.0 GB (15.7 GB usable)

@Bigfoot71
Copy link
Contributor Author

That seem lot better than my results, are you using PLATFORM_DESKTOP_WIN32 backend or PLATFORM_DESKTOP_SDL?

I'm using PLATFORM_DESKTOP_SDL on Linux Mint
I'm building with CMake in release mode, so normally with O3 (GCC)
For the CPU, I have an AMD Ryzen 5 3600 3.6 GHz, but I have an old one to test it on, I'll see about that later

@Bigfoot71
Copy link
Contributor Author

@raysan5 Quick question, regarding the framebuffer blit, can we remove the blit to floating-point formats?

I mean, when we want to copy the result to an UNCOMPRESSED_R32G32B32A32 image, for example.

They're there mainly to make the implementation complete, but they don't seem useful as is, and they can simplify a lot of things.

If you prefer to keep it so that it is complete, I will understand.
Otherwise we may set an INVALID_ENUM error or something else if we try to do it.

@raysan5
Copy link
Owner

raysan5 commented Oct 23, 2025

Sorry, I was building it in Debug mode...

Ok, now I'm building with VS2022 (MSVC), Windows 10 64bit, now in Release mode (O2) and getting better results:
~2500 bunnies @ 15 FPS.

@raysan5
Copy link
Owner

raysan5 commented Oct 23, 2025

can we remove the blit to floating-point formats?

Sure! I doubt anyone is going to use floating-point channels with a software renderer! In any case, some log error can be shown if anyone tries to request that...

@Bigfoot71
Copy link
Contributor Author

@raysan5 Alright, I tried a bunch of stuff, and here's the report.

The problem right now is that we can't (or rather, don't want to) work directly with integers for colors, because to reduce the load on interpolations we're currently doing increments instead of a full 'lerp' all the time during rasterization. That means we need to be able to work with fractions.

So we have three options:

  • Keep using floats like we do now
  • Use fixed-point
  • Or lerp uint8_t with floats (expensive)

So really, there are only two real solutions in my opinion: floats or fixed-point.

I've just done a huge amount of work removing all floating-point operations from the hot path (clipping/rasterization).

I still have a few issues with clipping, which I can handle, but for now, using fixed-point, I dropped from 950 bunnies to 800 before falling under 60 FPS (yes, I like bunnies).

So there's a performance hit on desktop PCs. My guess is that it's because the code becomes less vectorizable by the compiler (with O2/O3) when using fixed-point, plus there's the cost of shifts on every mul/div.

However, if we want to run RLSW on devices without an FPU, it could be a good thing. I think with a bit of elbow grease and some smart tricks we could get back to the same performance as before (and probably more), but there will still be a penalty compared to all-float on desktop.

Otherwise, we can just assume there will always be an FPU and a compiler capable of auto-vectorizing efficiently.

So yeah, it's up to you. We have both options in my opinion, it depends on what you prefer for later or now, and which use cases you want to prioritize.

If you want to go with fixed-point despite the desktop performance hit, everything’s almost ready, just the clipping is still acting up.

And if anyone has any ideas or anything to say, I’m all ears!

@Bigfoot71
Copy link
Contributor Author

I forgot to specify, if in my test I ended up replacing everything with fixed point, it's because I initially did it only for the colors, but that still introduced a lot of float/fixed conversion, so I migrated everything.

@raysan5
Copy link
Owner

raysan5 commented Oct 24, 2025

@Bigfoot71 Thanks for the detailed explanation and all the testing. Despite I like fixed-point (very old-school), I think the best approach for the future is using floats and expect a FPU available. Definitely compilers can do a great job at optimizing code if not mixing fixed-point with float operations.

I think we can explore other possible optimization routes, did you detect some costly operation or bottleneck? I guess alpha blending should be costly, maybe allow a path with no alpha operations? texture fetching is usually another costly operation (at least on GPU side), what interpolation is using at the moment? And also the rasterization process for the triangles, probably some work can be done on that side... Just giving some random ideas.

Obviously SIMD operations and multi-threading (tile-based rendering) would improve performance but it would imply specific hardware implementations and a considerable complexity cost... still, in current design, is vertex transform and pixel raster/blending vectorizable?

@JeffM2501 What are your thoughts?

@Bigfoot71
Copy link
Contributor Author

Bigfoot71 commented Oct 24, 2025

Quick reply, we could test OpenMP for parallelization, not ideal for every case, but worth a try.

Blending and sampling are costly too. Nearest filtering avoids interpolation, but UV wrapping can bottleneck a bit, hence the custom sw_saturate, slightly faster than clamp(x, 0, 1) on floats.

Specialized paths could help, though they're tricky to fit into the hot path. There's already one for axis-aligned quads (faster than two triangles), similar ideas might apply elsewhere.

Edit: try not to flood the thread

This adds SIMD framebuffer read/write, and also texel fetch.
This supports SSE2/SSE41 and SISD fallback (also includes ARM NEON support, conceptually identical but still needs testing).
@Bigfoot71
Copy link
Contributor Author

Bigfoot71 commented Oct 24, 2025

I added SIMD framebuffer read/write and texel fetch (SSE2/SSE41), and I also switched the framebuffer to RGBA32 by default. There's still packed RGB16 and RGB8 support, which doesn't benefit from SIMD.

There's also ARM NEON support, which is normally conceptually identical, but it needs to be tested. It's normally fine, unless I missed something.

With SSE2, I went from 950 to 1450 bunnies, and with SSE41 up to 1530 bunnies (before dropping below 60).

It could be even more efficient if we could process multiple pixels at once, but it gets technical with scanlines.

As for the rest, the blend mode could be rewritten quite easily in SIMD, but I think it's better to leave that to the compiler. The code is very simple here, but I could always run a test if necessary.

There might be something to do for bilinear filtering, I'll think about it.

I also have a SIMD matrix multiplication implementation aside, and other small things, I can add them, it's still a detail but it can help.

(sorry for being long in my messages, I try not to leave anything out if anyone wants to help)

@raysan5
Copy link
Owner

raysan5 commented Oct 24, 2025

I added SIMD framebuffer read/write and texel fetch (SSE2/SSE41), and I also switched the framebuffer to RGBA32 by default. There's still packed RGB16 and RGB8 support, which doesn't benefit from SIMD.

Wow! Big improvement! Tested it on my system and now I'm getting ~700 bunnies @ 60 FPS and ~1500 bunnies @ 30 FPS! Amazing! Many 2D games (and tools UIs) do not need that many sprites on screeen!

It could be even more efficient if we could process multiple pixels at once, but it gets technical with scanlines.

No worries, I think those latests improvements are already a huge boost!

I also have a SIMD matrix multiplication implementation aside, and other small things, I can add them, it's still a detail but it can help.

Feel free, whenever you want.

Thanks for the detailed explanations, impressive to see how everything improved! In the following days I'll do some comparison with official Mesa implementations (softpippe and llvmpipe) on my RISC-V Orange Pi RV2 board, to see how it compares. I'll reeport the results in the open discussion Issue.

I think we can merge current changes, let me know when ready.

@Bigfoot71
Copy link
Contributor Author

Bigfoot71 commented Oct 24, 2025

I made a few last adjustments to the code, including one that disables sampling if the texture is '0'
Also, the blend mode is disabled if we have GL_ONE, GL_ZERO as factors, it's a bit silly, but hey

We could also, when loading textures, check if at least one alpha value is lower than 255, and if during rendering all vertex colors are fully opaque, then disable blending if it's an alpha mix.

But that might be going a bit too far, in cases like that just changing the pipeline state costs absolutely nothing anyway.

Note that the default texture is kept at '0' because it is used in the case where a texture is generated but an image is not "uploaded" to it.


You can merge, I think it's good to go. However, if there are better ideas regarding textures, the discussion could still be relevant.

There will still be some minor issues with the rasterizer where the triangles jitter a bit. It shouldn't be too hard to fix, but it requires carefully reviewing the logic, I'll try to look into that later.

@Bigfoot71 Bigfoot71 marked this pull request as ready for review October 24, 2025 16:13
@raysan5 raysan5 merged commit 39242db into raysan5:master Oct 24, 2025
16 checks passed
@raysan5
Copy link
Owner

raysan5 commented Oct 24, 2025

@Bigfoot71 Thanks! Great improvements! Merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants