Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undocumented regression in temp plugin #11294

Closed
quentinmit opened this issue Jun 13, 2022 · 8 comments · Fixed by #14575
Closed

Undocumented regression in temp plugin #11294

quentinmit opened this issue Jun 13, 2022 · 8 comments · Fixed by #14575
Assignees
Labels
bug unexpected problem or unintended behavior regression something that used to work, but is now broken upstream bug or issues that rely on dependency fixes

Comments

@quentinmit
Copy link

Relevant telegraf.conf

[[inputs.temp]]

Logs from Telegraf

n/a

System info

Telegraf 1.22.4 on Debian 11

Docker

No response

Steps to reproduce

  1. Configure Telegraf with a bare temp input on a Linux system that has hwmon devices.

Expected behavior

In older versions (at least Telegraf <=1.17.3 but I haven't exhaustively tested), each temperature from /sys/class/hwmon/*/temp_* is exposed as a stream with a sensor tag derived from the filename in /sys:

sensor=coretemp_core0_critalarm
sensor=coretemp_core0_input
sensor=coretemp_core0_max
sensor=coretemp_core1_crit
sensor=coretemp_core1_critalarm
sensor=coretemp_core1_input
sensor=coretemp_core1_max
...

This is also what is documented in the temp plugin's README.

Actual behavior

In new builds (at least Telegraf >=1.22.4, sorry I have't bisected) only the /sys/class/hwmon/*/temp_*_input temperatures are exposed:

sensor=coretemp_core0
sensor=coretemp_core1
...

which means that the max/crit/etc. values are no longer being reported anywhere, nor are any sensors that don't follow the pattern of temp_*_input. It looks like this was broken when you pulled in shirou/gopsutil#905

I'm not sure what you should do to fix this now, but metrics shouldn't have undocumented regressions in their structure like this between versions. Ideally you would revert back to the old behavior because there is now missing data, in addition to the schema change. It looks like gopsutil now exposes the max and crit levels as fields that you could turn into metrics, but the other values are just dropped by gopsutil.

Additional info

No response

@quentinmit quentinmit added the bug unexpected problem or unintended behavior label Jun 13, 2022
@srebhan srebhan added the regression something that used to work, but is now broken label Jun 14, 2022
@srebhan srebhan self-assigned this Jun 14, 2022
@srebhan
Copy link
Contributor

srebhan commented Jun 14, 2022

@quentinmit can you please test the artifacts built in #11301 once it is finished!? I currently see no (non-manual) way to get back _critalarm... Would this be an issue for you?

@quentinmit
Copy link
Author

It's not just _critalarm that would still be missing; e.g. one of my systems has _alarm and _min that wouldn't be exposed. See the docs at https://www.kernel.org/doc/Documentation/hwmon/sysfs-interface - there are a ton of different possible fields (and the names are up to each driver so they're not super consistent across chipsets). Note that a few of them (e.g. *alarm) are bitmasks and not actually temperatures in degrees Celsius, so it was always a little weird that they were reported as temp measurements. I like what you've done with mapping them to fields.

Probably this needs to be fixed upstream in gopsutil to correctly bring back all the values. I'm guessing they just didn't test their change on a large enough variety of systems to discover how inconsistent the hwmon files are.

@quentinmit
Copy link
Author

quentinmit commented Jun 19, 2022

Testing the artifacts from #11301 on one of my machines, I get:

# 1.17.3:
nvme_composite_alarm temp=0
nvme_composite_crit temp=84.85
nvme_composite_input temp=49.85
nvme_composite_max temp=81.85
nvme_composite_min temp=-273.15
nvme_sensor1_input temp=49.85
nvme_sensor1_max temp=65261.85
nvme_sensor1_min temp=-273.15
nvme_sensor2_input temp=54.85
nvme_sensor2_max temp=65261.85
nvme_sensor2_min temp=-273.15
k10temp_tctl_input temp=51.375
k10temp_tdie_input temp=51.375
# 1.23.0:
nvme_composite temp=49.85
nvme_sensor_1 temp=49.85
nvme_sensor_2 temp=54.85
k10temp_tctl temp=50.5
k10temp_tdie temp=50.5
# #11301
nvme_composite_crit temp=84.85
nvme_composite_max temp=81.85
nvme_composite_input temp=49.85
nvme_sensor_1_crit temp=0
nvme_sensor_1_max temp=65261.85
nvme_sensor_1_input temp=49.85
nvme_sensor_2_crit temp=0
nvme_sensor_2_max temp=65261.85
nvme_sensor_2_input temp=54.85
k10temp_tctl_crit temp=0
k10temp_tctl_max temp=0
k10temp_tctl_input temp=50
k10temp_tdie_crit temp=0
k10temp_tdie_max temp=0
k10temp_tdie_input temp=50

As expected, _alarm and _min are still missing.

You're reporting new max and crit streams with a value of 0 for k10temp which isn't supposed to have them. (But 0 is also an unlikely but valid value for max or crit, so you can't just ignore a value of 0... this is a painfully broken API.)

Also, it looks like the naming is subtly different (nvme_sensor_1_input vs nvme_sensor1_input).

And finally, this is specific to Linux... I don't have a Windows machine handy but I would imagine their sensors don't follow this naming convention, so make sure you're not introducing a new regression on non-Linux platforms.

@srebhan
Copy link
Contributor

srebhan commented Jun 21, 2022

@quentinmit yeah I tried to work around the change in gopsutil (the one you mentioned), but that is the best we can get from them. Let me discuss in the team on how to handle the situation as the only solution I see is to manually restore the previous mechanism for Linux. :-S

@quentinmit
Copy link
Author

Yeah, I'd say figure out how to avoid reporting 0 when the limits don't exist, and otherwise that's the best you can do given the current gopsutil API. But maybe you can chase down an upstream bug to get them to fix/revert their API.

@srebhan
Copy link
Contributor

srebhan commented Jun 29, 2022

@quentinmit reported the issue upstream in shirou/gopsutil#1319.

@powersj powersj added the upstream bug or issues that rely on dependency fixes label Aug 2, 2023
@srebhan
Copy link
Contributor

srebhan commented Oct 31, 2023

Attempted fix in shirou/gopsutil#1546.

@srebhan
Copy link
Contributor

srebhan commented Jan 15, 2024

@quentinmit sorry for coming back so late to this issue. Can you please test the binary in PR #14575 once CI finished the tests!? With metric_format = "v1" you should see all previous sensors with the previous names...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior regression something that used to work, but is now broken upstream bug or issues that rely on dependency fixes
Projects
None yet
3 participants