Skip to content
This repository has been archived by the owner on Nov 8, 2022. It is now read-only.

snap PCM task gives wrong values #38

Open
keren42 opened this issue Feb 21, 2017 · 8 comments
Open

snap PCM task gives wrong values #38

keren42 opened this issue Feb 21, 2017 · 8 comments
Labels

Comments

@keren42
Copy link

keren42 commented Feb 21, 2017

using ubuntu 16.04, xeon CPU E5-2699 v4.
snap release 1.0

When running pcm task we get the following output from task watch:

Watching Task (6e582608-b41e-4d0a-9a3a-109c259d8581):
NAMESPACE DATA TIMESTAMP
^Cntel/pcm/AFREQ -1 2017-02-21 17:02:20.411681592 +0200 IST
/intel/pcm/C0res% 0 2017-02-21 17:02:20.411688011 +0200 IST
/intel/pcm/C1res% 71 2017-02-21 17:02:20.411663986 +0200 IST
/intel/pcm/C2res% 0 2017-02-21 17:02:20.411675044 +0200 IST
/intel/pcm/C3res% 0 2017-02-21 17:02:20.411682865 +0200 IST
/intel/pcm/EXEC 0 2017-02-21 17:02:20.411660247 +0200 IST
/intel/pcm/FREQ 0 2017-02-21 17:02:20.411665382 +0200 IST
/intel/pcm/INST 0 2017-02-21 17:02:20.411662151 +0200 IST
/intel/pcm/INSTnom 0 2017-02-21 17:02:20.411666862 +0200 IST
/intel/pcm/INSTnom% 0 2017-02-21 17:02:20.411676387 +0200 IST
/intel/pcm/IPC -1 2017-02-21 17:02:20.411668292 +0200 IST
/intel/pcm/L2HIT 1 2017-02-21 17:02:20.411677697 +0200 IST
/intel/pcm/L2MISS 0 2017-02-21 17:02:20.411684095 +0200 IST
/intel/pcm/L3HIT 1 2017-02-21 17:02:20.411669659 +0200 IST
/intel/pcm/L3MISS 0 2017-02-21 17:02:20.411679016 +0200 IST
/intel/pcm/LMB 5556 2017-02-21 17:02:20.411670942 +0200 IST
/intel/pcm/PhysIPC -1 2017-02-21 17:02:20.411685391 +0200 IST
/intel/pcm/PhysIPC% -25 2017-02-21 17:02:20.411686744 +0200 IST
/intel/pcm/READ 0.0595 2017-02-21 17:02:20.411672321 +0200 IST
/intel/pcm/RMB 0 2017-02-21 17:02:20.411673628 +0200 IST
/intel/pcm/WRITE 0.0578 2017-02-21 17:02:20.411680269 +0200 IST

All metrics except READ,WRITE,LMB and C1res% remain with the above constant values,
while the pcm application's output shows that all the above metrics do have changing values which from some reason are not received from the task.
snap daemon is running in root mode and the task is in user mode.

What can be the reason for the inconsistency between the values?

Thanks.

@IRCody
Copy link

IRCody commented Feb 21, 2017

@keren42: I noticed similar behavior on my machine running ubuntu 16.04 with a xeon (though not the exact model you have). I'm not overly familiar with the plugin/pcm but Ithink that maybe @IzabellaRaulin knows more than me and can help us out.

@IzabellaRaulin
Copy link
Contributor

IzabellaRaulin commented Feb 22, 2017

hello @keren42, which version of snap-plugin-collector-pcm do you use? Is is the latest one - version 10?
Could you also provide your task manifest and your version of pcm.x? There is a notice in release notes of pcm.x that support Intel® Xeon® E5 v4 processor family was added in version 2.11+(read more here https://software.intel.com/en-us/articles/intel-performance-counter-monitor, in the end of the page)

@keren42
Copy link
Author

keren42 commented Feb 22, 2017

Hi @IzabellaRaulin , we have 3 servers, only two of them have the problem I described above. On those two we have updated to version 10 and problem remains.
On the third server (which still has version 9) that used to work fine we can see normal data flow when using task watch as before, yet now in the pcm application we see bad figures (as in the attached screenshot) preceded with the following error:
Check if you run other competing Linux perf clients.
Error while reading perf data. Result is 0

pcm_screenshot

The task manifest:

{"version":1,"schedule":{"type":"simple","interval":"5s"},"workflow":{"collect":{"metrics":{"/intel/pcm/AFREQ":{},"/intel/pcm/C0res%":{},"/intel/pcm/C1res%":{},"/intel/pcm/C2res%":{},"/intel/pcm/C3res%":{},"/intel/pcm/EXEC":{},"/intel/pcm/FREQ":{},"/intel/pcm/INST":{},"/intel/pcm/INSTnom":{},"/intel/pcm/INSTnom%":{},"/intel/pcm/IPC":{},"/intel/pcm/L2HIT":{},"/intel/pcm/L2MISS":{},"/intel/pcm/L3HIT":{},"/intel/pcm/L3MISS":{},"/intel/pcm/PhysIPC":{},"/intel/pcm/PhysIPC%":{},"/intel/pcm/READ":{},"/intel/pcm/WRITE":{},"/intel/pcm/RMB":{},"/intel/pcm/LMB":{}},"config":{},"process":null,"publish":[{"plugin_name":"rabbitmq","config":{"uri":"10.0.0.216:5672","exchange_name":"snap","routing_key":"ben1","exchange_type":"direct","durable":false}}]}}}

@IzabellaRaulin
Copy link
Contributor

IzabellaRaulin commented Feb 22, 2017

@Keren, it looks like an issue related with pcm.x. The snap-plugin-collector-pcm only executes the pcm.x command and parses the output, so it does not happen in snap's collector.

So, the question is why the pcm.x output looks like as you posted above - to investigate that more,
could you at first ensure:

  • that NMI watchdog is disabled; to do this
    • at running time: echo 0 > /proc/sys/kernel/nmi_watchdog
    • or permanently: echo 'kernel.nmi_watchdog=0' >> /etc/sysctl.conf
  • there is a access to CPUs MSRs (Model Specific Register); to do this, execute modprobe msr
  • pcm.x is run as a root

Also, could you provide:

  • the version of pcm.x which is used on each of your servers
  • information about CPU architecture/Model name for each of your server (lscpu)
    (please, add a notice to those for which of servers pcm.x output looks as expected and for which an error is reported)

@mkleina
Copy link

mkleina commented Feb 23, 2017

I can reproduce @keren42 issue (using PCM 201702 release). After some research, I was able to get readings by disabling perf support in PCM Makefile and then rebuilding PCM binaries:

# rely on Linux perf support (user needs CAP_SYS_ADMIN privileges), comment out to disable
ifneq ($(wildcard /usr/include/linux/perf_event.h),)
CXXFLAGS += -DPCM_USE_PERF
endif

try to comment out these lines.

Before:

 Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP

   0    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     71
   1    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     67
   2    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     70
   3    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     67
   4    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     71
   5    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     67
   6    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     70
   7    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     67
---------------------------------------------------------------------------------------------------------------
 SKT    0     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     67
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.00   -1.00   0.00    -1.00       0        0      1.00    1.00    -nan    -nan     N/A

After:

Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3MPI | L2MPI | TEMP

   0    0     0.03   0.96   0.03    0.65     196 K    424 K    0.54    0.26    0.00    0.00     73
   1    0     0.02   1.05   0.02    0.61     180 K    358 K    0.50    0.28    0.00    0.00     70
   2    0     0.01   0.69   0.02    0.61     215 K    439 K    0.51    0.17    0.00    0.01     72
   3    0     0.03   0.98   0.03    0.71     270 K    485 K    0.44    0.22    0.00    0.01     68
   4    0     0.02   1.25   0.02    0.73     119 K    222 K    0.46    0.36    0.00    0.00     73
   5    0     0.07   1.17   0.06    0.57     367 K    891 K    0.59    0.55    0.00    0.00     69
   6    0     0.02   1.41   0.01    0.70      54 K    113 K    0.52    0.41    0.00    0.00     71
   7    0     0.07   1.17   0.06    0.62     328 K    814 K    0.60    0.57    0.00    0.00     68
---------------------------------------------------------------------------------------------------------------
 SKT    0     0.03   1.09   0.03    0.63    1732 K   3750 K    0.54    0.44    0.00    0.00     68
---------------------------------------------------------------------------------------------------------------
 TOTAL  *     0.03   1.09   0.03    0.63    1732 K   3750 K    0.54    0.44    0.00    0.00     N/A

Maybe it is some issue with privileges (as mentioned in Makefile comment), but tried to change privileges using sudo setcap cap_sys_admin+ep ./pcm.x without any result.

@IzabellaRaulin
Copy link
Contributor

@Keren, any updates? Does the disabling of perf support in PCM resolve your issue?

@keren42
Copy link
Author

keren42 commented Feb 28, 2017

@IzabellaRaulin, yes, pcm task watch in all of the servers now display the data, we are not entirely sure what made the difference. Do we need to execute the modprobe msr command after each time we reset the server?
The pcm app seem to be working fine now and displays the bad output only when the pcm task is also running.

@PatrykMatyjasek
Copy link
Contributor

@keren42 you can add msr module to load at boot editing file /etc/modules on debian based OS, or /etc/modules.conf on CentOS. This module is needed to read performance counters.
PCM app displays bad output when task is running because only one application can access to registers, and read them. Plugin is also running pcm.x application, so if you run other pcm.x instance it cannot access to correct data. If you delete your task, start pcm.x application, and create new one task you will get -1 once again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

5 participants