Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High Kepler CPU usage under normal workloads #1670

Open
vimalk78 opened this issue Aug 5, 2024 · 6 comments
Open

High Kepler CPU usage under normal workloads #1670

vimalk78 opened this issue Aug 5, 2024 · 6 comments

Comments

@vimalk78
Copy link
Collaborator

vimalk78 commented Aug 5, 2024

Without any load on system, kepler CPU usage goes upto 20%

@vimalk78
Copy link
Collaborator Author

vimalk78 commented Aug 5, 2024

#1660 (comment)

@vimalk78
Copy link
Collaborator Author

vimalk78 commented Aug 6, 2024

on latest main, if machine is loaded with stress-ng, the kepler cpu usage spikes. In comparison, the kepler before ringbuffer does not show increase in cpu if machine is loaded.

asciicast

@vimalk78
Copy link
Collaborator Author

vimalk78 commented Aug 6, 2024

comparing with old code, some kepler cpu usage spike is understandable since some processing ( 3 map lookup, 2 update, 1 delete) was happening in kernel context and cpu cycles for these were accounted for in the kernel, which now happens in user space and gets counted as kepler cpu.

need to check if we can reduce the cpu spike in kepler when machine is loaded.

@dave-tucker
Copy link
Collaborator

need to check if we can reduce the cpu spike in kepler when machine is loaded.

exactly! I'm now able to reproduce with stress-ng and I'm working to keep that CPU spike as low as possible.

@rootfs
Copy link
Contributor

rootfs commented Aug 6, 2024

@dave-tucker can you create a feature branch, move the code there, and revert the related commits?

@vimalk78
Copy link
Collaborator Author

vimalk78 commented Aug 7, 2024

i ran some perf stat tests to check impact of kepler on context switch time. idea being that since kepler traps sched_switch and does some processing, it should have some impact on context switch time. stress-ng is used in parallel to simulate load.

  • without running kepler
root@bkr18:~# sudo perf stat -a -e sched:sched_switch --timeout 600000 # with no kepler with load

 Performance counter stats for 'system wide':

        90,480,301      sched:sched_switch                                                    

     600.105927296 seconds time elapsed
  • with running kepler release-0.7.11
root@bkr18:~# sudo perf stat -a -e sched:sched_switch --timeout 600000 # with kepler 0.7.11 with load

 Performance counter stats for 'system wide':

        87,500,721      sched:sched_switch                                                    

     600.100293869 seconds time elapsed
  • with running kepler latest (with ring buffer )
root@bkr18:~# sudo perf stat -a -e sched:sched_switch --timeout 600000 # with kepler latest with load

 Performance counter stats for 'system wide':

        79,620,228      sched:sched_switch                                                    

     600.099929726 seconds time elapsed

Observation: with kepler running, the number of context switches goes down, as expected. But with ring-buffer changes, the drop is more than 7-11 release.

Test is run on a bare-metal machine with almost no other load.

stress-ng command:
stress-ng --cpu 8 --iomix 4 --vm 2 --vm-bytes 128M --fork 4 --timeout 11m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants