cpu: Implement steal/guest/guest_nice counters #2

killerbees19 · 2020-05-08T14:34:56Z

I think it's time to implement these counters after so many years! ;-)

I've tried to maintain backwards-compatibility. This should not break old kernel versions.

Indeed, it's only useful at VM guests or hypervisors…

steal (since Linux 2.6.11)
       (8) Stolen time, which is the time spent in
       other operating systems when running in a virtu‐
       alized environment

guest (since Linux 2.6.24)
       (9) Time spent running a virtual CPU for guest
       operating systems under the control of the Linux
       kernel.

guest_nice (since Linux 2.6.33)
       (10) Time spent running a niced guest (virtual
       CPU for guest operating systems under the con‐
       trol of the Linux kernel).

Successfully tested at OpenWrt 19.07.2 (x86_64 @ KVM)

kvisle · 2020-05-08T19:37:49Z

Be aware that the user time counter is actually including guest time, so in the event where you actually use the guest counter for something in the same graph - you should subtract the guest value from the user value as well.

killerbees19 · 2020-05-09T11:28:56Z

@kvisle Afaik the real cpu plugin is handling it the same way like my PR?

So for the graph it should be fine. (Example graph from Debian Stretch with munin-plugins-core.)

Last graph from Muninlite with my PR.

But you're right, this could eventually be changed to draw a better graph. For the moment it's just to fill it up to 100%.

sumpfralle · 2020-05-17T17:11:09Z

Be aware that the user time counter is actually including guest time, so in the event where you actually use the guest counter for something in the same graph - you should subtract the guest value from the user value as well.

@kvisle: thank you for raising this point. But I failed to find such a connection in man proc. Do you have another source for this relationship? Maybe it is even the reason for Debian#767100?

@killerbees19: your change proposal looks good to me. Indeed it is very useful when running OpenWrt in a VM. I will merge it, after we finished the cpu/guest discussion.

kvisle · 2020-05-17T18:39:39Z

@kvisle: thank you for raising this point. But I failed to find such a connection in man proc. Do you have another source for this relationship? Maybe it is even the reason for Debian#767100?

That relationship is probably best documented in the source code: https://github.com/torvalds/linux/blob/master/kernel/sched/cputime.c#L139-L156

Regarding 767100, I did some experiments with looking at these counters under different kinds of load - and this is counter-behavior. This is more visible under higher polling resolutions, and under higher load.

sumpfralle · 2020-08-22T18:49:10Z

Sorry for letting this rot for so long ...

I am about to publish another release of muninlite - thus let us wrap this disucssion up before.

Be aware that the user time counter is actually including guest time, so in the event where you actually use the guest counter for something in the same graph - you should subtract the guest value from the user value as well.

@kvisle: after taking a look at the example graphs provided by @killerbees19, I fail to understand this connection: the graphs look fine - everything adds up to multiple of 100%. Maybe I am missing something obvious.

But nevertheless @killerbees19 seems to agree with @kvisle's point:

But you're right, this could eventually be changed to draw a better graph. For the moment it's just to fill it up to 100%.

I am confused - your help is appreciated :)

Given my partial understanding, I see two approaches:

A) There is a problem with the implementation of the cpu plugin in munin core and with this specific pull request (cpu: Implement steal/guest/guest_nice counters #2). In this case I would suggest, that we fix this here for this pull request and I will apply a comparable change to the cpu plugin in munin core.
B) This pull request is correct. In this case I will just merge it.

Please enlighten me!

sumpfralle · 2020-10-07T16:41:00Z

@killerbees19: ping?

killerbees19 · 2020-10-08T04:38:52Z

@killerbees19: ping?

Oops, sorry! I missed this one. I'll have a look at it again, but currently I'm a little bit busy. I'm not 100% sure what's the correct answer.

kimheino · 2021-04-08T07:49:45Z

@killerbees19: ping?

killerbees19 · 2021-04-09T01:06:29Z

Thanks @kimheino for bringing this to my attention again. I'm sorry for the long delay! 🥺

Test script

Command (originally it's a one-liner) to watch /proc/stat values growing:

t=-2; last=;
while true; do
  t=$((t + 1));

  if [ $((t % 10)) -eq 0 ]; then
    echo
    echo "   us   ni   sy   id   wa   hi   si   st   gu   gn";
  fi;

  current=$(grep '^cpu ' /proc/stat | cut -c6-);

  if [ ! -z "$last" ]; then
    for i in `seq 1 10`; do
      a=$(echo "$last" | cut -d ' ' -f $i);
      b=$(echo "$current" | cut -d ' ' -f $i);
      echo -n "$(printf '%5d' "$((b - a))")";
    done;
    echo;
  fi;

  last=$current;
  sleep 1;
done

Test case

Ok, here's a dual core KVM host with low load. After 6 lines I've fired up yes > /dev/null at a single core KVM guest:

   us   ni   sy   id   wa   hi   si   st   gu   gn
   14    0   14  187    0    0    0    0    9    0
   14    0   14  191    0    0    0    0    9    0
   10    0   15  188    0    0    0    0    6    0
   11    0    9  191    1    0    0    0    6    0
   10    0   15  188    0    0    0    0    6    0
   11    0   12  190    0    0    1    0    7    0
   84    0   10  124    0    0    0    0   76    0
  119    0   11   90    0    0    0    0  115    0
  120    0   10   91    0    0    0    0  114    0
  122    0   10   91    0    0    0    0  115    0

   us   ni   sy   id   wa   hi   si   st   gu   gn
  119    0   12   92    0    0    0    0  115    0
  119    0   11   90    0    0    0    0  114    0
  119    0    9   92    0    0    0    0  115    0
  120    0   11   91    0    0    0    0  114    0
  119    0    9   91    0    0    0    0  115    0
  120    0   10   92    0    0    0    0  114    0
  123    0    9   90    0    0    0    0  118    0
  121    0    9   92    0    0    0    0  115    0
  123    0   10   88    0    0    1    0  118    0
  122    0    9   90    0    0    0    0  116    0

   us   ni   sy   id   wa   hi   si   st   gu   gn
  123    0    9   90    0    0    0    0  117    0
  119    0   11   91    0    0    0    0  116    0
  121    0   13   88    0    0    1    0  116    0
  122    0   12   89    0    0    1    0  114    0
   20    0   10  186    0    0    0    0    6    0
   10    0   12  188    0    0    0    0    6    0
   11    0   14  186    0    0    1    0    5    0
    9    0   12  187    1    0    0    0    5    0
   14    0   14  185    0    0    0    0    8    0
   21    0   12  180    0    0    0    0   15    0

The last 6 lines are showing values with stopped workload at the guest.

Conclusion

I still agree with @kvisle in the point, that guest values are already included in user counter. (And guest_nice values are included in nice counter.)

We could ignore this fact and just go on. But strictly speaking the graphs are wrong! --upper-limit is the one and only barrage that stops both fields to mess everything up.

Our current drawing situation at CPU plugin(s):

\ system.draw AREA
  \ user.draw STACK
    \ nice.draw STACK
      \ idle.draw STACK
        \ iowait.draw STACK
          \ irq.draw STACK
            \ softirq.draw STACK
              \ steal.draw STACK
                \ guest.draw STACK
                  \ guest_nice.draw STACK

I think something like this would be better:

| system.draw AREA
  \
    | user.draw STACK
    | guest.draw AREA (?)
      \
        | nice.draw STACK
        | guest_nice.draw AREA (?)
          \
            | idle.draw STACK
              \
                | iowait.draw STACK
                  \ 
                    | irq.draw STACK
                      \
                        | softirq.draw STACK
                          \
                            | steal.draw STACK

guest AREA should be drawn inside user AREA
guest_nice AREA should be drawn inside nice AREA
Maybe change colors of guest/guest_nice fields...

I've no clue if this is even possible or how to change the config output to achieve this. Any ideas?

Alternate option

Just merge the PR and think about it later 😝

There's little to no difference to have perfect graphs. RAW values from all fields are valid (as someone would expect it from /proc/stat) and therefore everyone can use it without restrictions at limits or for aggregated graphs.

It's up to you.

Remark

If we'd recalculate user/nice values (by subtracting guest/guest_nice), we'd possibly break various user limit configurations. Nobody expects a new behavior (munin) or a new data field (muninlite) so warn/crit limits for the user/nice fields would break in certain situations. And that's really bad. Moreover any change would differ from expected output from /proc/stat.

Neustradamus · 2025-01-02T02:00:57Z

Any progress on this PR?

killerbees19 · 2025-01-12T21:17:00Z

It's not my turn anymore, right?

kenyon · 2025-01-12T22:22:43Z

@killerbees19 can you rebase on current master and squash the commits (or give maintainers the ability to update this pull request)? Thanks.

steal (since Linux 2.6.11) (8) Stolen time, which is the time spent in other operating systems when running in a virtu‐ alized environment guest (since Linux 2.6.24) (9) Time spent running a virtual CPU for guest operating systems under the control of the Linux kernel. guest_nice (since Linux 2.6.33) (10) Time spent running a niced guest (virtual CPU for guest operating systems under the con‐ trol of the Linux kernel).

killerbees19 · 2025-01-13T00:16:17Z

@kenyon Done

Neustradamus · 2025-01-13T03:26:36Z

@killerbees19: Good job, thanks!
@kenyon: Thanks for merging!

(Never too late)

killerbees19 marked this pull request as ready for review May 8, 2020 14:36

killerbees19 force-pushed the feature/steal-guest branch from eadf11e to 5f55dcf Compare January 13, 2025 00:15

kenyon approved these changes Jan 13, 2025

View reviewed changes

kenyon merged commit 0d7e824 into munin-monitoring:master Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: Implement steal/guest/guest_nice counters #2

cpu: Implement steal/guest/guest_nice counters #2

killerbees19 commented May 8, 2020 •

edited

Loading

kvisle commented May 8, 2020

killerbees19 commented May 9, 2020 •

edited

Loading

sumpfralle commented May 17, 2020

kvisle commented May 17, 2020

sumpfralle commented Aug 22, 2020

sumpfralle commented Oct 7, 2020

killerbees19 commented Oct 8, 2020

kimheino commented Apr 8, 2021

killerbees19 commented Apr 9, 2021 •

edited

Loading

Neustradamus commented Jan 2, 2025

killerbees19 commented Jan 12, 2025

kenyon commented Jan 12, 2025

killerbees19 commented Jan 13, 2025

Neustradamus commented Jan 13, 2025

cpu: Implement steal/guest/guest_nice counters #2

cpu: Implement steal/guest/guest_nice counters #2

Conversation

killerbees19 commented May 8, 2020 • edited Loading

kvisle commented May 8, 2020

killerbees19 commented May 9, 2020 • edited Loading

sumpfralle commented May 17, 2020

kvisle commented May 17, 2020

sumpfralle commented Aug 22, 2020

sumpfralle commented Oct 7, 2020

killerbees19 commented Oct 8, 2020

kimheino commented Apr 8, 2021

killerbees19 commented Apr 9, 2021 • edited Loading

Test script

Test case

Conclusion

Alternate option

Remark

Neustradamus commented Jan 2, 2025

killerbees19 commented Jan 12, 2025

kenyon commented Jan 12, 2025

killerbees19 commented Jan 13, 2025

Neustradamus commented Jan 13, 2025

killerbees19 commented May 8, 2020 •

edited

Loading

killerbees19 commented May 9, 2020 •

edited

Loading

killerbees19 commented Apr 9, 2021 •

edited

Loading