Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BBS] Add request metrics for BBS endpoints #898

Open
klapkov opened this issue Jan 24, 2024 · 7 comments
Open

[BBS] Add request metrics for BBS endpoints #898

klapkov opened this issue Jan 24, 2024 · 7 comments
Assignees

Comments

@klapkov
Copy link
Contributor

klapkov commented Jan 24, 2024

Add request metrics for BBS endpoints

Summary

Currently BBS does not emit much information about the performance of it's endpoints. What we emit currently is RequestsCount and RequestLatency ( in regard to BBS endpoints ). They cover all the endpoints. That is why we propose to introduce a more detailed look under the hood of the BBS server. We can achieve this with the help of a module already used in the rep and locket.
https://github.com/cloudfoundry/locket/blob/main/metrics/helpers/request_metrics.go

With this helper we have a lot of more info on the performance per endpoint. It can be implemented on a handler level and emit new metrics once per minute ( the default report interval ). It gives us these metrics :

  • RequestsStarted
  • RequestsSucceeded
  • RequestsFailed
  • RequestsInFlight
  • RequestsCancelled
  • RequestLatencyMax

Now the tricky question is, which endpoints implement it. Here are most of the BBS endpoints:

//desiredLRP endpoints
"DesiredLRPSchedulingInfos", "DesiredLRPRoutingInfos", "DesiredLRPByProcessGuid", "DesiredLRPs",

//desiredLRP lifecycle endpoints
"UpdateDesireLRP", "RemoveDesiredLRP", "DesireDesiredLRP",

//actualLRP endpoints
"ActualLRPs", 

// actualLRP lifecycle endpoints
"ClaimActualLRP", "StartActualLRP", "CrashActualLRP", "FailActualLRP", "RemoveActualLRP", "RetireActualLRP",

// evacuation endpoints
"RemoveEvacuatingActualLRP", "EvacuateClaimedActualLRP", "EvacuateCrashedActualLRP", "EvacuateStoppedActualLRP", "EvacuateRunningActualLRP",

// task endpoints
"Tasks", "TaskByGuid", "DesireTask", "StartTask", "CancelTask", "RejectTask", "CompleteTask", "ResolvingTask", "DeleteTask",

Let's say we implement the helper with every one of these endpoints, which would give us perfect visibility on all operations of the BBS server. Here we have 28 endpoints. Multiplied by 6 = 168 new metrics. That is a lot.

If we do not want to introduce this many new metrics, we can try to divide them into groups. Those groups can be for example:

  • "DesiredLRPEndpoints"
  • "DesiredLRPLifecycleEndponts"
  • "ActualLRPSEndpoint"
  • "ActualLRPLifecycleEndpoints"
  • "EvacuationEndpoints"
  • "TaskEndpoints"
Maybe StartActualLRP should be in a group of it's own, since it is called periodically.  

In this case we have 36 new metrics. With this approach , we do not get quite as much information, but at least we know how a certain operation group performs. The above groups are only an example. If we go with this path, we should decide how to split these groups.

Maybe we can even make the endpoints which implement the helper configurable, so that everyone can use what best suits them.
Nevertheless, I think this topic is worth a discussion. I will come back sort of a PoC in the next days.

Diego repo

https://github.com/cloudfoundry/bbs

@geofffranks
Copy link
Contributor

I'm in favor of adding better metrics for more visibility, but also want to keep the new metric count lower. Would your goals be solved by keeping the current two metrics, but expanding them out into either groups of endpoints? I'm just wondering if having the started vs success/failed/inflight/canceled request counts would be worth the additional metrics over just a raw request count.

For grouping - maybe it makes sense to do a small-scale profile of endpoints with metrics on each endpoint, and then we think about grouping less-frequent/less-impactful calls, but keep the endpoints most used and most susceptible to performance issues isolated.

@klapkov can you add an item to the next WG meeting's agenda to get broader opinions on this?

@MarcPaquette
Copy link
Contributor

Hi @geofffranks @klapkov ,
Did the discussions for this issue occur during the WG Meeting? What's the status of this PR (and cloudfoundry/bbs#80 too)

@klapkov
Copy link
Contributor Author

klapkov commented May 2, 2024

Hello @MarcPaquette,

I was not able to join last month's WG meeting, but we will include the topic in the agenda for the upcoming one next week. We will discuss the topic there and hopefully get some input from everyone.

@geofffranks
Copy link
Contributor

@klapkov did this ever get discussed in the WG?

@klapkov
Copy link
Contributor Author

klapkov commented Jun 27, 2024

@geofffranks Sadly no, this is an important topic for us, but we had priority switch within the team and it seems like for now, we won't invest time in the topic. But for sure at some point we will return to this.

@geofffranks
Copy link
Contributor

@klapkov Any idea when you'll want to reporitize this? Should we close this issue + PR out until it can be reprioritized and is under development again, or do you want to kep them open?

@klapkov
Copy link
Contributor Author

klapkov commented Oct 21, 2024

@geofffranks Again, sorry for the big delay on this. We have in in the current backlog and someone will probably take this task soon, so it would be nice to keep them open for a little while longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Pending Review | Discussion
Development

No branches or pull requests

3 participants