Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf improvements #237

Open
viblo opened this issue Nov 5, 2023 · 5 comments
Open

Perf improvements #237

viblo opened this issue Nov 5, 2023 · 5 comments

Comments

@viblo
Copy link

viblo commented Nov 5, 2023

This is not really a issue, more of a FYI and question if someone else has looked into these things that maybe could fit on the forum, but I feel its more visible to put it here..

Anyway, just a couple of days ago I added a batch api to Pymunk (Python 2d physic library built on Chipmunk) to get some data quicker, since its quite expensive to call C code from Python. In my simple test case I was mainly bottle-necked by Chipmunk performance and not Python as is the usual case. So, I started to look into how I could increase the performance of Chipmunk (on desktop) if possible. This is a short report so far:

I did all tests on Windows 11 in WSL (Ubuntu) on my Thinkpad X1 get 7 laptop with a i5-8265U CPU (~Skylake). To compare performance I used bench.c, but shortened it to 1000 steps.

I tried to reorder the structs:

  1. More or less half the time is spent in cpArbiterApplyImpulse if I run the demo bench.c through perf record.
  2. Using pahole I could see that both cpArbiter struct and the cpBody struct are not cache line aligned for how they are used in the apply impulse function.
  3. As a quick and easy test if making it more cache friendly could help I reordered the struct fields to put all the fields used in cpArbiterApplyImpulse first in those two structs.
  4. This resulted in a 4.5% time saving of total time on the benchmarks (averaged over 6 runs)!

I also tried to compile with march=skylake. Not sure how I would use this in a real case with Pymunk, but worth testing at least. It saved another 5% of the remaining times for a total saving of 9% .

These 2 things were the easiest I could think of (after I researched a bit how easy SIMD for x86 would be) to try.

Some other things I thought about to put all the data needed closer in memory (if they help or not I do not know yet)

  1. Inline the cpContact struct into cpArbiter
  2. Separate arbiters with 1 and 2 contacts
  3. Read out body fields and put into arbiter on collision, and then use them instead of going to body in apply impulse function.
  4. Collect the resulting velocities in a separate array that is written back to bodies afterwards
  5. Reorder things in cpArbiterApplyImpulse

I should note that I had 0 experience of optimizing C code before this. Actually I have almost 0 experience writing C code at all.

Any input is welcome!

@slembcke
Copy link
Owner

Hrm. So I've known for a while that a number of my early decisions in Chipmunk were fairly sub-optimal. In the past I did experiments where I rewrote large sections of Chipmunk to be SoA ordered data, and that helped a lot. The problem is that a lot of those structs are part of the public API and I can't really change them now. I'd never really considered the impact of just reordering fields though... I'm kind of surprised that inlining the contacts into the arbiter helped as the contacts are already linearly packed into memory in the order they are accessed. I've considered Chipmunk to be "stable" for quite a few years now where I don't really have any big or breaking changes to make. I'm not sure how I feel about this, but could maybe be convinced. I certainly wouldn't mind if you made a "turbo" fork or something.

I'm currently making a new game, and I actually wrote a new (but very very simple) physics engine for it to finally try out some new ideas. It's vaguely ECS based, and it heavily uses SoA. I knew it would be faster, but I was shocked to find it running several times faster. (Though it's hard to make a very direct comparison) https://github.com/slembcke/veridian-expanse/blob/master/src/drift_physics.c Anyway... not really relevant, but it's interesting to think where Chipmunk could go in the future.

@viblo
Copy link
Author

viblo commented Nov 15, 2023

I was also surprised that I got a clear effect from my very limited changes, especially for just rearranging stuff which is easy and safe to do (unless you care about binary compatibility of course). I can see your desire to keep it stable. I have had some user complain when moving from Pymunk 4.x to 5.0 which broke several things and its a difficult tradeoff. At least its better its stable than things changing for the worse, or more bugs etc are introduced.

One thing Pymunk is used for is different kinds of research (maybe this is even the most common usage). Its an quick and easy way to simulate an environment and then try different things (for example reinforcement learning, motion prediction and a lot of other stuff).

Some years ago I had a discussion about future of Pymunk with some toolkit for developing AI algorithms (i.e. simulate an environment for a robot and other stuff). They were using a python library built on top of Box2D, which was not maintained.
One thing I think they cared about were simulation speed which becomes extra important if you want to re-run the same simulation many many times to train an algorithm. In the end I think they went with a different path and uses something else that have more big-company support and can be accelerated on GPU. Anyway, my point is that for example for (python) games I think Pymunk/Chipmunk is often fast enough, other things will limit. But for these other cases I think there's lots of room for faster simulation speed.

Now, I just do Pymunk dev sometimes in the evening, so if I really will be able to make some breaking big performance improvements is still to open. If nothing else I will make a PR of rearranged structs, but needs some more test to see what was useful first.

@viblo
Copy link
Author

viblo commented Nov 15, 2023

Btw, have you looked at GPU acceleration of Chipmunk? Or maybe its not really relevant for games when the GPU is busy doing graphics anyway?

@slembcke
Copy link
Owner

Yeah... Like there's a lot I want to do to modernize Chipmunk. At this point it's 17 years old! Like there's a lot of data changes I could do to vastly improve performance, and the whole API itself could use a pretty big "modern C" upgrade. On the other hand I have 2 big projects for work, and a host of other hobby projects. :( I just don't see myself being able to pull that off without making something much simpler and more focused like the physics I made for Veridian Expanse. On the other (other) hand, Erin Catto has been working towards Box2D 3.0, and it's absolutely a wishlist of everything I would put in a Chipmunk refresh. Modern C API (not C++!), heavy multi-threading focus from the start, object handles instead of direct pointers, etc. It sounds pretty great.

Pymunk for research: That reminds me! Not sure if I shared this story with you before. A few years back I helped mentor high school students for the FIRST Robotics competition. One of those students is now getting his PhD in robotics actually. I ended up sitting next to him at a LAN party of all places and we talked a lot about game tech. He didn't realize that I worked in games, and I mentioned a bunch of our projects including Chipmunk. Apparently he used Pymunk in his PhD research to do his initial simulations to develop the control systems they were working on. Small world! :) I don't think I've said this in a while, but it's awesome that you made Pymunk. It really seems to help a lot of people.

GPU acceleration: Not really. I've read a bit about how people have approached some of the algorithms involved, but that's about it. I think realistically Chipmunk's OO-ish API is a terrible fit for how GPU accelerated data needs to be organized and you'd need to copy a lot of data around. I've done just enough related work too to know that there are a lot of subtle mistakes you can make when trying to optimize at that level that can destroy performance too.

@viblo
Copy link
Author

viblo commented Nov 16, 2023

Yeah, the lack of time/focus is always a biggie. There's so many things that would be interesting to try! This is also why I made the post here in the first place, best to share the small thing while its fresh, even if nothing more is done..

Small world indeed :) As always, cool to hear about someone using Pymunk!

GPU: Ah, yeah that is what I guessed. Difficult/impossible to shoehorn into an existing design, and it can have wide-ranging side-effects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants