-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
smartmon scripts: quote in value #149
Comments
Is smartmon.py or smartmon.sh "better"? |
deletingTried this:
Not yet fully successful ... |
There is discussion in #119 to replace both existing smartmon scripts with a new Python script and parse the JSON output of |
ah, I see. But no direct solution available. I also find https://github.com/prometheus-community/smartctl_exporter ... doesn't work very well either. |
@stefangweichinger I am working on an interim refactor of the existing smartmon.py to use the official Prometheus client_python library for rendering the metrics. This will solve any quoting issues. However, since the homegrown metric printing functions in smartmon.py take some liberties with with the validity of the metrics format, it will take a little while. |
Wow, sounds good. Good luck with this. I will try to come up with some ugly workaround for those Toshiba disks (hopefully). |
@stefangweichinger Please test The original code made heavy use of generators, which was only possible due to the fact that it took some shortcuts with label consistency. When using the official Prometheus client libraries, a metric's label names need to be declared in advance. If a particular data point has no value for a certain label, it should be set to empty string. All labels are rendered by the official client libraries, even when they are empty string. However, most homegrown metric rendering code that I've seen ignores this completely, and data points often have variable label sets. Node_exporter tolerates this when reading the |
@dswarbrick thanks, will try asap |
@dswarbrick The script works fine so far, I get data/metrics. So far no graphs in the dashboard, I think I use some dashboard that uses wrong queries. For example it seems to mix up "device" and "disk". Do you have a dashboard that you know this new script should work with? Thank you! Ah, btw: that one disk is parsed like this:
So that quote is escaped inside the field. That might be OK for prometheus and Grafana, I assume. |
@stefangweichinger Sorry, I can't advise you on which dashboard is best to use. I usually make my own dashboards, and just use the community ones for inspiration occasionally. I suspect that you might not have the best success with community dashboards since your disks are behind a MegaRAID controller. The dashboard that you're using might be geared more towards consumer-grade setups, i.e. expects disks to be typical The escaped double-quote should be fine (and in fact the original script should have been doing that anyway) - see the string replace in: labels = ','.join(
'{k}="{v}"'.format(k=k, v=v.replace('"', '\\"')) for k, v in metric.labels.items()) |
@dswarbrick ok .. I have to reduce the moving parts here. I decide to use your latest script to collect data, the prom-files look ok to me, scraping/exporting seems to work, only the displaying and querying seems to be a problem right now. I will try to completely get rid of all smartmon-related time series in prometheus now, to remove these problematic entries from that Toshiba-disk. And then I will try to decide which dashboard to use. I had to edit the variables in one of them, that one used a outdated(?) field in the metrics. |
For example this query:
I only get graphs for one out of three servers currently. I will make sure to use the same script everywhere and cleanup things. |
and yes: disks behind that megaraid-hardware gives additional headache |
Perhaps I can offer a little more clarification, as I have used this collector in the past at a previous job, with a mixture of simple HBA (or onboard SATA) setups, as well as MegaRAID controllers. With onboard SATA controllers or HBAs that expose each disk to the OS as individual drives, the However, consider a MegaRAID controller with multiple disks attached, and e.g. a four-disk RAID5 array. The RAID5 virtual drive may be exposed to the OS as e.g. If you had multiple MegaRAID controllers, the If you're using a dashboard that primarily focuses on the |
@dswarbrick thanks a lot, yes, understood. I managed to write a shell script that removes all metrics starting with "smartmon" from prometheus. So I should start from scratch regarding these metrics. Data is scraped, but in Grafana I still get stuff like:
Disabled scraping on that And only getting data from one server (instead of 2 enabled, 1 disabled). fixed by python-library ... Grepping the prom file from your
correct? In my dashboards I see queries with other metrics, I have to get rid of those (although it looks tempting to be able to build on their queries and not having to start from scratch). Sorry for getting slightly offtopic somehow. |
@stefangweichinger Are you writing that regex by hand yourself? If so it does indeed have a missing closing parenthesis. This should work:
You might need to escape the forward-slashes also, since they are delimiters in JavaScript regexes. If on the other hand that regex has been generated by Grafana (e.g. as a result of a multiple-select drop-down list), that smells like a bug. You don't specify what exactly your needs are, but if you are wanting to graph the drive temperature, then yes, that metric is a good starting point. The nature of time-series databases such as Prometheus means that all mistakes are eventually forgotten. If it doesn't bother you too much, just wait for the buggy metrics to age out. It's much less hassle than trying to delete series. |
@dswarbrick no that regex comes from the query in the dashboard, I assume. My needs: basically I'd be happy with some information as given by "megaclisas-status":
I want
The rest would be nice, but not needed. Basically I want some alerts on the temps, done. I did another look at Maybe I run circles here, definitely losing track of which route to follow. |
Generally it would not be advisable to filter by something as specific as drive model number. Probably the only time I've had to do that was when I created some alerting rules for warning about known-buggy firmware versions on particular Intel SSD models. I would recommend removing whatever in that dashboard is adding that label filter, although you may also want to take the issue up with the Grafana support community, because it shouldn't be a problem. The For the controller temperature, you will of course definitely need to use the storcli.py collector, and refer to the |
@dswarbrick Thanks a lot for confirming and explaining. So the basics seem to work now, as far as I see: thanks! Will your script in that branch be changed soon? Any plans? Just asking because I would then have to track that somehow. I use https://grafana.com/grafana/dashboards/10530-s-m-a-r-t-disk-monitoring-for-prometheus-dashboard/ as a base and edit it right now. Looks quite old, some queries use metrics which I don't find in the output of your |
@stefangweichinger There is a PR to merge it into master: #153 I'm trying to do some long-overdue cleanup and modernization of the collectors in this repo. |
What's the problem with it? I recently migrated from smartmon.py to smartctl_exporter because the smartmon.py script doesn't expose SMART attribute ID 11 (Calibration Retry Count). And I later learned that smartmon.py doesn't work with NVMe drives either: #110.
Since that's merged now, does that mean this issue can be closed? |
I fiddle with extracting data out of
smartctl
.A disk gives me this:
The problem is in
There's an escaped quote after "3.5", this leads to issues with the text exporter.
I browsed for various forks and versions of smartmon.sh and smartmon.py.
I currently seem to have the extraction fixed, but I seem to have some problematic data lines /metrics in prometheus.
Sorry for being a bit offtopic: how can I delete these smartmon-related time series without deleting too much? I need to clean this up before I can dig further and check if the scripts work now.
I don't know if having quotes in the SMART values is OK.
The text was updated successfully, but these errors were encountered: