feat: limit scrape_duration_seconds visualization to CHT instance #73

IanMinash · 2023-07-16T17:02:57Z

Limited the values of the Scrape Duration graph to only come from the currently selected CHT instance($cht_instance). This prevents non-related VMs from showing up in a dashboard thats only focused on CHT monitoring.

jkuester

Thanks for the PR @IanMinash! I agree that it would be helpful to limit the scrape metrics to the current CHT instance in context. Unfortunately, your change would hide not only the metrics for other CHT instances, but also the scrape metrics from the rest of the monitoring stack (exporters, Grafana, etc) and, most importantly, would hide Postgres scrape metrics (for anyone who has connected Watchdog to their Postgres instance).

I have suggested a change that would keep the other non-CHT metrics in view while not displaying metrics for the out-of-context CHT instances. Additionally, it will filter the postgres metrics to only show the one associated with the current CHT instance in context.

Can you try these changes and see if they are acceptable? Thanks!

jkuester · 2023-07-18T13:49:52Z

grafana/provisioning/dashboards/CHT/cht_admin_overview.json

@@ -2052,7 +2052,7 @@
                "uid": "PBFA97CFB590B2093"
              },
              "editorMode": "code",
-              "expr": "scrape_duration_seconds",
+              "expr": "scrape_duration_seconds{instance=\"$cht_instance\"}",


Suggested change

"expr": "scrape_duration_seconds{instance=\"$cht_instance\"}",

"expr": "scrape_duration_seconds unless (scrape_duration_seconds{job=\"cht\", instance!=\"$cht_instance\"} or scrape_duration_seconds{job=\"postgres\", cht_instance!=\"$cht_instance\"})",

Hi @jkuester, in our case, the monitoring stack is used beyond CHT and the changes suggested still include targets unrelated to CHT. It is also not quite clear to me why I should be interested in scrape duration for grafana for example, whilst I'm interested in processes/metrics that directly relate to CHT.

In my opinion, we should be explicit with the jobs we intend to target then exclude out-of-context instances though I'm still keen on understanding your POV.

Hmm okay, I think I see where you are coming from here. This whole Monitoring Stack row was originally intended to be just an at-a-glance health-check for the Watchdog processes themselves (so Grafana, Prometheus, and the exporters). That being said, I understand the value in being able to easily check the scrape history for the CHT instance in context. It is almost like we are technically trying to achieve different goals:

See overall health of Prometheus by seeing the historical trends for the for the scrape times of all targets. This lets you easily identify outliers or understand when a target is not getting scraped as expected.

See the historical scrape performance for the particular CHT instance in context (informing, among other things, the freshness of the metrics for that instance).

Here is my proposal: lets have separate panels to address both of there goals! In this PR, we can leave the panel in the Monitoring Stack row as it was and instead add a new panel on the CHT Admin Details dashboard (maybe up at the top outside of all the existing rows?) that just shows the scrape data for the CHT instance in context (along with the Postgres instance for that instance if it exists). Then, I have logged #78 which can be addressed later to move the whole Monitoring Stack row out of the CHT Admin Overview dashboard.

Let me know what you think!

I agree that the Monitoring Stack row should be moved out of the CHT Admin dashboard.

Limit scrape_duration_seconds viz to CHT instance

0b02a92

jkuester self-requested a review July 17, 2023 17:06

jkuester requested changes Jul 18, 2023

View reviewed changes

jkuester changed the title ~~Limit scrape_duration_seconds visualization to CHT instance~~ feat: limit scrape_duration_seconds visualization to CHT instance Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: limit scrape_duration_seconds visualization to CHT instance #73

feat: limit scrape_duration_seconds visualization to CHT instance #73

IanMinash commented Jul 16, 2023

jkuester left a comment

jkuester Jul 18, 2023

IanMinash Jul 19, 2023

jkuester Jul 20, 2023

IanMinash Jul 21, 2023

	"expr": "scrape_duration_seconds{instance=\"$cht_instance\"}",
	"expr": "scrape_duration_seconds unless (scrape_duration_seconds{job=\"cht\", instance!=\"$cht_instance\"} or scrape_duration_seconds{job=\"postgres\", cht_instance!=\"$cht_instance\"})",

feat: limit scrape_duration_seconds visualization to CHT instance #73

Are you sure you want to change the base?

feat: limit scrape_duration_seconds visualization to CHT instance #73

Conversation

IanMinash commented Jul 16, 2023

jkuester left a comment

Choose a reason for hiding this comment

jkuester Jul 18, 2023

Choose a reason for hiding this comment

IanMinash Jul 19, 2023

Choose a reason for hiding this comment

jkuester Jul 20, 2023

Choose a reason for hiding this comment

IanMinash Jul 21, 2023

Choose a reason for hiding this comment