💡 [Feature] monitoring: Add a second Prometheus to scrap the first Prometheus to keep stats for long term at a lower granularity #277

tlvu · 2023-01-17T06:34:27Z

Description

Currently component ./components/monitoring scrap every 5 minutes and keep the stats for 90 days.

If we want longer stats for high level trend keeping (ex: 5 years), we could use a second Prometheus to scrap the first Prometheus daily (averaging the values) so we can keep longer stats without consuming too much disk space.

This second Prometheus should be on a different machine than PAVICS so when PAVICS is down, we can still access those longterm stats.

To explore the federation feature of Prometheus whether it is simpler to implement/deploy or gives better longterm stats results.

The text was updated successfully, but these errors were encountered:

huard · 2024-04-05T18:25:15Z

I suspect that it is possible to simply add new rules that aggregate metrics at a lower frequency. That is, I don't think we need a second instance.

https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/

fmigneault · 2024-04-05T18:49:24Z

I would also prefer a lower frequency logging approach than having a duplicate instance.

tlvu · 2024-04-08T17:17:15Z

The reason I suggested a second Prometheus is because the retention policy is instance wide and not per metric, see

birdhouse-deploy/birdhouse/components/monitoring/docker-compose-extra.yml

Line 46 in 749c230

- --storage.tsdb.retention.time=90d

However, this has been a while ago. Maybe newer version of Prometheus allow for per metric data retention. To explore.

So yes we can lower the polling frequency, but if we can not increase the retention duration, then we still do not have long term stats, which is our ultimate goal.

Also, I suggested a second instance of Prometheus on a different machine. So it's not really "duplicated" because it's not the same role:

it can still provide stats even if the real PAVICS host is down (hardware failure, data corruption, ...)
it can aggregate all other PAVICS hosts (staging, tests, ...)

huard · 2024-04-08T17:47:32Z

No, you're right, there is one retention policy per instance: https://stackoverflow.com/questions/69630832/how-to-store-data-in-prometheus-with-different-retention-time-per-job-or-targets#:~:text=Prometheus%20doesn't%20support%20multiple,configs%20and%20distinct%20retention%20periods.

fmigneault · 2024-04-08T21:21:12Z

Ok. If it is a limitation of Prometheus, then let's try with a second one.

huard · 2024-04-10T18:10:00Z

Post on solutions to this problem, which in the jargon seems to be known as "downsampling":
https://last9.io/blog/downsampling-aggregating-metrics-in-prometheus-practical-strategies-to-manage-cardinality-and-query-performance/

mishaschwartz · 2024-05-31T19:32:08Z

Instead of a second prometheus to scrape the first we could also use one of the other technologies they recommend for longterm storage (https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage). Both Thanos and M3 seem to be recommended elsewhere.

We could create recording rules for the metrics we care about storing in prometheus and then use an external tool to store those specific metrics over a longer term and query just those metrics over larger time-scales as needed.

mishaschwartz · 2024-06-03T19:33:00Z

I've been playing around with Thanos. One issue with Thanos is that is stores all metrics from a prometheus instance which can end up being a very large amount of data (even if the data is compacted somewhat: https://thanos.io/tip/thanos/quick-tutorial.md/#compactor).

As discussed here there are potential ways to store only those metrics that we care about which would reduce the amount of additional disk space needed to store the data.

It looks like we're likely going to need to introduce a second prometheus instance even if we also use Thanos so that we can select which metrics we store long term.

#461) ## Overview The `prometheus-longterm-metrics` component collects longterm monitoring metrics from the original prometheus instance (the one created by the ``components/monitoring`` component). Longterm metrics are any prometheus rule that have the label ``group: longterm-metrics`` or in other words are selectable using prometheus's ``'{group="longterm-metrics"}'`` query filter. To see which longterm metric rules are added by default see the ``optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template`` file. To configure this component: * update the ``PROMETHEUS_LONGTERM_RETENTION_TIME`` variable to set how long the data will be kept by prometheus * update the ``PROMETHEUS_LONGTERM_STORE_INTERVAL`` variable to set how often the longterm metrics rules will be calculated. For example, setting it to ``10h`` will calculate these metrics every 10 hours. Enabling the `prometheus-longterm-metrics` component creates the additional endpoint ``/prometheus-longterm-metrics``. The `thanos` component enables better storage of longterm metrics collected by the ``optional-components/prometheus-longterm-metrics`` component. Data will be collected from the ``prometheus-longterm-metrics`` and stored in an S3 object store indefinitely. When enabling this component, please change the default values for the ``MINIO_ROOT_USER`` and ``MINIO_ROOT_PASSWORD`` by updating the ``env.local`` file. These set the login credentials for the root user that runs the [minio](https://min.io/) object store. Enabling the `thanos` component creates the additional endpoints: * ``/thanos-query``: a prometheus-like query interface to inspect the data stored by thanos * ``/thanos-minio``: a minio web console to inspect the data stored by minio. This also includes an update to the prometheus version from `v2.19.0` to the current latest `v2.52.0`. This is to required to support the interaction between prometheus and thanos. ## Changes **Non-breaking changes** - New component version: prometheus:v2.52.0 ## Related Issue / Discussion - Resolves #277 - Add some initial metrics as described in #447 but we should really add more (either to this PR or in a future PR) by adding more rules to the `birdhouse/optional-components/prometheus-longterm-metrics/config/monitoring/prometheus.rules.template` file. ## Additional Information - I tested upgrading the prometheus version and there were no issues (no loss of data, no changed APIs etc.) - Note that the thanos set up is pretty minimal but probably good enough for our purposes. We can always add more of the thanos features/components in the future if needed. ## CI Operations  birdhouse_daccs_configs_branch: master birdhouse_skip_ci: false

tlvu added the enhancement New feature or request label Jan 17, 2023

tlvu self-assigned this Jan 17, 2023

huard pinned this issue Apr 19, 2023

huard mentioned this issue Apr 5, 2024

💡 [Feature] Log download stats from THREDDS server #444

Open

huard mentioned this issue Apr 10, 2024

💡 [Feature] Add recording rules to Prometheus configuration to store hourly/daily metrics #447

Open

mishaschwartz mentioned this issue Jun 7, 2024

Add the prometheus-longterm-metrics and thanos optional components #461

Merged

mishaschwartz closed this as completed in #461 Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡 [Feature] monitoring: Add a second Prometheus to scrap the first Prometheus to keep stats for long term at a lower granularity #277

💡 [Feature] monitoring: Add a second Prometheus to scrap the first Prometheus to keep stats for long term at a lower granularity #277

tlvu commented Jan 17, 2023

huard commented Apr 5, 2024

fmigneault commented Apr 5, 2024

tlvu commented Apr 8, 2024

huard commented Apr 8, 2024

fmigneault commented Apr 8, 2024

huard commented Apr 10, 2024

mishaschwartz commented May 31, 2024

mishaschwartz commented Jun 3, 2024

💡 [Feature] monitoring: Add a second Prometheus to scrap the first Prometheus to keep stats for long term at a lower granularity #277

💡 [Feature] monitoring: Add a second Prometheus to scrap the first Prometheus to keep stats for long term at a lower granularity #277

Comments

tlvu commented Jan 17, 2023

Description

huard commented Apr 5, 2024

fmigneault commented Apr 5, 2024

tlvu commented Apr 8, 2024

huard commented Apr 8, 2024

fmigneault commented Apr 8, 2024

huard commented Apr 10, 2024

mishaschwartz commented May 31, 2024

mishaschwartz commented Jun 3, 2024