Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: offline CPU handling #873

Open
mjtrangoni opened this issue Mar 29, 2018 · 8 comments · May be fixed by #3032
Open

Feature request: offline CPU handling #873

mjtrangoni opened this issue Mar 29, 2018 · 8 comments · May be fixed by #3032

Comments

@mjtrangoni
Copy link
Contributor

mjtrangoni commented Mar 29, 2018

Host operating system: output of uname -a

Linux xxxx 3.10.0-693.2.2.el7.ppc64le #1 SMP Sat Sep 9 03:58:38 EDT 2017 ppc64le ppc64le ppc64le GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 0.16.0-rc.0 (branch: build, revision: 8ec35dfcd0aaa05b6039fc3c4bef7a675d419f6b)
  go version:       go1.10

node_exporter command line flags

default

Are you running node_exporter in Docker?

no

What did you do that produced an error?

none

What did you expect to see?

This PPC server has SMT=2 (Simultaneous multithreading) which can scale on-the-fly up to 8x.

# ppc64_cpu --smt
SMT=2
# ppc64_cpu --info
Core   0:    0*    1*    2     3     4     5     6     7
Core   1:    8*    9*   10    11    12    13    14    15
Core   2:   16*   17*   18    19    20    21    22    23
Core   3:   24*   25*   26    27    28    29    30    31
Core   4:   32*   33*   34    35    36    37    38    39
Core   5:   40*   41*   42    43    44    45    46    47
Core   6:   48*   49*   50    51    52    53    54    55
Core   7:   56*   57*   58    59    60    61    62    63
Core   8:   64*   65*   66    67    68    69    70    71
Core   9:   72*   73*   74    75    76    77    78    79
Core  10:   80*   81*   82    83    84    85    86    87
Core  11:   88*   89*   90    91    92    93    94    95
Core  12:   96*   97*   98    99   100   101   102   103
Core  13:  104*  105*  106   107   108   109   110   111
Core  14:  112*  113*  114   115   116   117   118   119
Core  15:  120*  121*  122   123   124   125   126   127
Core  16:  128*  129*  130   131   132   133   134   135
Core  17:  136*  137*  138   139   140   141   142   143
Core  18:  144*  145*  146   147   148   149   150   151
Core  19:  152*  153*  154   155   156   157   158   159

# lscpu
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0,1,8,9,16,17,24,25,32,33,40,41,48,49,56,57,64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121,128,129,136,137,144,145,152,153
Off-line CPU(s) list:  2-7,10-15,18-23,26-31,34-39,42-47,50-55,58-63,66-71,74-79,82-87,90-95,98-103,106-111,114-119,122-127,130-135,138-143,146-151,154-159
Thread(s) per core:    2
Core(s) per socket:    5
Socket(s):             4
NUMA node(s):          4
Model:                 2.1 (pvr 004b 0201)
Model name:            POWER8E (raw), altivec supported
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0,1,8,9,16,17,24,25,32,33
NUMA node1 CPU(s):     40,41,48,49,56,57,64,65,72,73
NUMA node16 CPU(s):    80,81,88,89,96,97,104,105,112,113
NUMA node17 CPU(s):    120,121,128,129,136,137,144,145,152,153

# ppc64_cpu --smt=8
# ppc64_cpu --info
Core   0:    0*    1*    2*    3*    4*    5*    6*    7*
Core   1:    8*    9*   10*   11*   12*   13*   14*   15*
Core   2:   16*   17*   18*   19*   20*   21*   22*   23*
Core   3:   24*   25*   26*   27*   28*   29*   30*   31*
Core   4:   32*   33*   34*   35*   36*   37*   38*   39*
Core   5:   40*   41*   42*   43*   44*   45*   46*   47*
Core   6:   48*   49*   50*   51*   52*   53*   54*   55*
Core   7:   56*   57*   58*   59*   60*   61*   62*   63*
Core   8:   64*   65*   66*   67*   68*   69*   70*   71*
Core   9:   72*   73*   74*   75*   76*   77*   78*   79*
Core  10:   80*   81*   82*   83*   84*   85*   86*   87*
Core  11:   88*   89*   90*   91*   92*   93*   94*   95*
Core  12:   96*   97*   98*   99*  100*  101*  102*  103*
Core  13:  104*  105*  106*  107*  108*  109*  110*  111*
Core  14:  112*  113*  114*  115*  116*  117*  118*  119*
Core  15:  120*  121*  122*  123*  124*  125*  126*  127*
Core  16:  128*  129*  130*  131*  132*  133*  134*  135*
Core  17:  136*  137*  138*  139*  140*  141*  142*  143*
Core  18:  144*  145*  146*  147*  148*  149*  150*  151*
Core  19:  152*  153*  154*  155*  156*  157*  158*  159*
# lscpu 
Architecture:          ppc64le
Byte Order:            Little Endian
CPU(s):                160
On-line CPU(s) list:   0-159
Thread(s) per core:    8
Core(s) per socket:    5
Socket(s):             4
NUMA node(s):          4
Model:                 2.1 (pvr 004b 0201)
Model name:            POWER8E (raw), altivec supported
L1d cache:             64K
L1i cache:             32K
L2 cache:              512K
L3 cache:              8192K
NUMA node0 CPU(s):     0-39
NUMA node1 CPU(s):     40-79
NUMA node16 CPU(s):    80-119
NUMA node17 CPU(s):    120-159

In the 'SMT=2' case there are 960 metrics we could ignore (4 sockets * 5 cores * 6 (8-2) threads * 8 modes).

# curl -s localhost:9100/metrics | egrep -w -v -e '(HELP|TYPE)' | grep node_cpu_seconds_total | wc -l
1280

My feature request is to reduce the amount of CPU metrics. There are 2 alternatives that come to mind,

  1. Ignoring the offline CPUs in the node_exporter
  2. Introducing a new label, online="0|1", and filtering during Prometheus scrape process.

What did you want to see instead?

# curl -s localhost:9100/metrics | egrep -w -v -e '(HELP|TYPE)' | grep node_cpu_seconds_total | grep 'online=1 ' | wc -l
320
@SuperQ
Copy link
Member

SuperQ commented Mar 29, 2018

Interesting, we get CPU metrics from /proc/stat. Is there an online/offline status file somewhere in /proc or /sys we can get this information from? I don't have access to any hardware like this to investigate the options.

@mjtrangoni
Copy link
Contributor Author

Hi @SuperQ , you can check this so,

# grep . /sys/devices/system/cpu/cpu*/online|head
/sys/devices/system/cpu/cpu0/online:1
/sys/devices/system/cpu/cpu100/online:0
/sys/devices/system/cpu/cpu101/online:0
/sys/devices/system/cpu/cpu102/online:0
/sys/devices/system/cpu/cpu103/online:0
/sys/devices/system/cpu/cpu104/online:1
/sys/devices/system/cpu/cpu105/online:1
/sys/devices/system/cpu/cpu106/online:0
/sys/devices/system/cpu/cpu107/online:0
/sys/devices/system/cpu/cpu108/online:0

@brian-brazil
Copy link
Contributor

Ignoring the offline CPUs in the node_exporter
Introducing a new label, online="0|1", and filtering during Prometheus scrape process.

Neither of these make sense semantically, the information for each cpu needs to be always there or always not there.
We could expose information about which/how many cpus are online.

@knweiss
Copy link
Contributor

knweiss commented Mar 29, 2018

Notice, that e.g. a broken CPU cache can trigger the offlining of a CPU during runtime, too.

I.e. the ability to count the number of online and offline CPUs would be useful for alerting in this case, too. Here's an example (x86_64!):

Mar 24 06:24:42 i Threshold based error status: yellow
Mar 24 06:24:42 i mcelog: Large number of corrected cache errors. System operating, but might lead
Mar 24 06:24:42 i mcelog: to uncorrected errors soon
Mar 24 06:24:42 i mcelog: MCA: Data CACHE Level-1 Data-Read Error
Mar 24 06:24:42 i mcelog: CPU 22 on socket 1 has large number of corrected cache errors in Level-1 Data
Mar 24 06:24:42 i mcelog: System operating correctly, but might lead to uncorrected cache errors soon
Mar 24 06:24:42 i mcelog: Running trigger `cache-error-trigger'
Mar 24 06:24:42 i mcelog: STATUS 8c40004000100135 MCGSTATUS 0
Mar 24 06:24:42 i mcelog: MCGCAP f000c14 APICID 38 SOCKETID 1
Mar 24 06:24:42 i mcelog: PPIN 8c20004000101151
Mar 24 06:24:42 i mcelog: CPUID Vendor Intel Family 6 Model 85
Mar 24 06:24:42 i mcelog: Offlining CPU 22 due to cache error threshold
Mar 24 06:24:42 i kernel: intel_pstate CPU 22 exiting
Mar 24 06:24:42 i kernel: smpboot: CPU 22 is now offline
Mar 24 06:24:42 i mcelog: Offlining CPU 46 due to cache error threshold

On CentOS the script that offlines a CPU in this case can be found here:

# cat /etc/mcelog/triggers/cache-error-trigger
#!/bin/bash
# cache error trigger. This shell script is executed by mcelog in daemon mode
# when a CPU reports excessive corrected cache errors. This could be a indication
# for future uncorrected errors.
[...]
for i in $AFFECTED_CPUS ; do
        logger -s -p daemon.crit -t mcelog "Offlining CPU $i due to cache error threshold"
        F=$(printf "/sys/devices/system/cpu/cpu%d/online" $i)
        echo 0 > $F
[...]
done

@SuperQ
Copy link
Member

SuperQ commented Mar 29, 2018

Yes, I think a separate bool metric is the right thing to do here.

node_cpu_online{cpu="x"}

Looking at some of my systems, none of them have and online indicator. This seems to be somewhat platform dependent.

What do we want to do if the cpu is offline, should we stop exposing /proc/stat counters for these CPUs? It seems reasonable to me, if we have the new bool metric.

@brian-brazil
Copy link
Contributor

What do we want to do if the cpu is offline, should we stop exposing /proc/stat counters for these CPUs? It seems reasonable to me, if we have the new bool metric.

That can cause problems with rates, they should stay exposed. Constant time series are cheap to store anyway.

@mjtrangoni
Copy link
Contributor Author

Hi @SuperQ @brian-brazil ,

I did some research about this topic, and also read the Linux documentation here. See,

Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files.  The internal
source for the output is in brackets ("[]").

    =========== ==========================================================
    kernel_max: the maximum CPU index allowed by the kernel configuration.
		[NR_CPUS-1]

    offline:	CPUs that are not online because they have been
		HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
		of CPUs allowed by the kernel configuration (kernel_max
		above). [~cpu_online_mask + cpus >= NR_CPUS]

    online:	CPUs that are online and being scheduled [cpu_online_mask]

    possible:	CPUs that have been allocated resources and can be
		brought online if they are present. [cpu_possible_mask]

    present:	CPUs that have been identified as being present in the
		system. [cpu_present_mask]
=========== ==========================================================

I think, we could iterate over /sys/devices/system/cpu/present, and set to 1 all that are also online here /sys/devices/system/cpu/online exposing node_cpu_online{cpu="x"}="0|1".

You can see this more detailed here.

  1. Not every Hardware shows cpu0.
  2. offline and possible are a bit tricky, see AMD EPYC and Intel Skylake. They are not always present and show the architecture's maximum CPUs.
  3. This is only checked on CentOS6.9 and CentOS7.4.

Another approach, thinking on sparing metrics, would be having something like this,

node_cpu_online_count=XX
node_cpu_present_count=XX

I will be making a PR on this soon if you agree.

@discordianfish
Copy link
Member

Agree that a separate node_cpu_online metric makes sense and that we should still keep exposing the stale values for offline cpu's as @brian-brazil suggested.

rexagod added a commit to rexagod/procfs that referenced this issue May 30, 2024
rexagod added a commit to rexagod/procfs that referenced this issue May 30, 2024
rexagod added a commit to rexagod/procfs that referenced this issue May 30, 2024
rexagod added a commit to rexagod/procfs that referenced this issue May 30, 2024
@rexagod rexagod linked a pull request May 30, 2024 that will close this issue
SuperQ pushed a commit to prometheus/procfs that referenced this issue Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants