You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanos, Prometheus and Golang version used:
0.37.2
We are doing some tests and a tenant started to ship quite a lot of data. This initially caused us to not have enough resources. After which we scaled things up, but it took a long while to get into a healthy shape.
I think the following screenshot shows pretty well the state. These are the Thanos Receivers running in 'router' / distributor mode:
Now, I get the increase in 409's, because we were bottlenecked, so this is somewhat inevitable. What I however cannot exactly grasp is the amount of 500's. Basically my question is: Where is that coming from?
How we eventually solved this, was by simply setting --receive.forward.async-workers=200 which I feel is quite a lot compared to the default of 5. What makes me curious though is that I feel that increasing the amount of replicas on the distributor hardly achieves the same result. This is based on nothing, but I feel that running 5 replicas with each 200 workers, works better than running 20 replicas with each 100 workers. Even though the latter should result in effectively more workers.
The forward delay was really high but remained "stable" around that ~12-14 minute mark. Which is kinda interesting as I would figure it would increase more over time, but it just seems to cap around that. Is there some 'limit' at which there is so much delay, it stops adding more things and thus returns 500's?
So, obviously, the root cause here is getting to this state in the first place. Which isn't a Thanos bug or whatever. We merely utilised this situation to get more experience and battle test the system. Yet I feel those 500's are very weird and I also feel like the amount of workers are limited by something else (?). I feel there should be a way to scale it in such way, that regardless of the load, retries, etc. the system should recover faster the more resources I throw at it.
More stats:
About ~25m active series
~500k series/s
~50k samples/s per distributor
7 receivers
replication factor of 3
The text was updated successfully, but these errors were encountered:
Wanted to add that I sort of figured out what caused it, although I still don't understand why.
Basically with increased load, latency also starts to grow. Even though our receives have enough CPU and memory headroom. Once this reaches 5s on p99 and maybe even on p95, then everything just falls apart and we get completely hammered with retries and more requests.
During that time, we did scale up, potentially way too late. That new instance went completely OOM from the start and well, everything else also went a bit chaotic.
We sort of solved this by stopping the distributors / routers, adding more receivers, and then adding a ton of distributors to handle the load, at which then started to settle.
I think my main question is how this receive latency increase under load can be prevented. Is the only solution to scale more horizontally?
Thanos, Prometheus and Golang version used:
0.37.2
We are doing some tests and a tenant started to ship quite a lot of data. This initially caused us to not have enough resources. After which we scaled things up, but it took a long while to get into a healthy shape.
I think the following screenshot shows pretty well the state. These are the Thanos Receivers running in 'router' / distributor mode:
Now, I get the increase in 409's, because we were bottlenecked, so this is somewhat inevitable. What I however cannot exactly grasp is the amount of 500's. Basically my question is: Where is that coming from?
How we eventually solved this, was by simply setting
--receive.forward.async-workers=200
which I feel is quite a lot compared to the default of5
. What makes me curious though is that I feel that increasing the amount of replicas on the distributor hardly achieves the same result. This is based on nothing, but I feel that running 5 replicas with each 200 workers, works better than running 20 replicas with each 100 workers. Even though the latter should result in effectively more workers.The forward delay was really high but remained "stable" around that ~12-14 minute mark. Which is kinda interesting as I would figure it would increase more over time, but it just seems to cap around that. Is there some 'limit' at which there is so much delay, it stops adding more things and thus returns 500's?
So, obviously, the root cause here is getting to this state in the first place. Which isn't a Thanos bug or whatever. We merely utilised this situation to get more experience and battle test the system. Yet I feel those 500's are very weird and I also feel like the amount of workers are limited by something else (?). I feel there should be a way to scale it in such way, that regardless of the load, retries, etc. the system should recover faster the more resources I throw at it.
More stats:
The text was updated successfully, but these errors were encountered: