"per-user series limit of 150000 exceeded" on Monolith VM Mimir but active time series per tenant is far lower #12619
-
Hey everyone, I am puzzled. I rolled out our Mimir and Alloy recently which includes a monolith Mimir on a VM. Simple as (the ingester replication is also set to 1 because I had an issue when trying to install Mimir with the default 3, I don't remember the error though). Alloy is deployed into our cluster. Everything works fine. But after a bit, I got the error "per-user series limit of 150000 exceeded" in Alloy's logs, and also reflected in Mimir's logs. "Strange, but we do have hundreds of pods". I ran mimirtool analyze, and I stripped some unused metrics from Alloy but there is only so much I could drop. BUT, and here is the kicker, from everything I could find the active series in the ingester was never near 150k. Now, I am sure that "whatever I could find" is not everything there is to find and some of my assumptions are undoubtedly incorrect. But I would like to understand why I was seeing that error (eventually I just cranked the limit) and why I still have to set it far higher than I would assume I need to. Before I increased the limit but after the failures started, the active series never seemed to breach 60k and the We only have the one tenant atm. During: curl localhost:9009/metrics | grep active_series
# HELP cortex_ingester_active_series Number of currently active series per user.
# TYPE cortex_ingester_active_series gauge
cortex_ingester_active_series{user="dxp-internal"} 50733
After I increased the limit and dropped a bunch of metrics, right now it's stabilized at ~140k. # HELP cortex_ingester_active_series Number of currently active series per user.
# TYPE cortex_ingester_active_series gauge
cortex_ingester_active_series{user="dxp-internal"} 138666 So, it is my understanding that the 150k limit would now suffice and yet as soon as I lower it back down to 150k the error comes back. In fact, anything below 250k throws the same error. And I am at a loss now, if someone could explain what I am doing wrong, or where it is I should be looking to understand what metric volume I am actually ingesting that would be great. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
The limit applies to in-memory series which are different than "active series". The limit is to protect ingesters and in-memory series is what matters in that respect. Active series is a Grafana Cloud billing concept (it's series that have had at least one sample in the last 20 minutes) and isn't really relevant to protecting ingesters. The metric you want to look at is |
Beta Was this translation helpful? Give feedback.
-
Ah, perfect, that is the one. Now it makes sense. Thanks! |
Beta Was this translation helpful? Give feedback.
The limit applies to in-memory series which are different than "active series". The limit is to protect ingesters and in-memory series is what matters in that respect. Active series is a Grafana Cloud billing concept (it's series that have had at least one sample in the last 20 minutes) and isn't really relevant to protecting ingesters. The metric you want to look at is
cortex_ingester_memory_series
. Active series and in-memory series can different when you have a lot of churn in metrics such as pods being scraped restarting often and getting a new hostname every time.