-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Causal Model #2
base: main
Are you sure you want to change the base?
Causal Model #2
Conversation
Signed-off-by: Joydeep Banerjee <[email protected]>
Signed-off-by: Joydeep Banerjee <[email protected]>
Signed-off-by: Joydeep Banerjee <[email protected]>
Signed-off-by: Joydeep Banerjee <[email protected]>
Signed-off-by: Joydeep Banerjee <[email protected]>
Signed-off-by: Joydeep Banerjee <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bjoydeep The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
Few things - Found out that search using DFS would be too naive. But doing graph search would work. So elaborated a bit on Causal Graph analysis vs Causal model. Hope it makes sense.
|
@bjoydeep thanks! Will take a look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for writing this up! Sorry for the delay in reviewing!
I have added some suggestions/comments on the source diff for the notebook.
Overall I think it's a good start!
"# Background\n", | ||
"\n", | ||
"1. We collect lots of metrics using the acm-inspector.\n", | ||
"1. We rely on a subset of those metrics to project [acm sizes](https://github.com/stolostron/capacity-planning) for green field capacity planning exercises.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny format nit, all bullets seem to be 1.
"## Performance Analysis - on a running ACM \n", | ||
"\n", | ||
"The goals are very different here. It is not about estimating the size of ACM Hub in a greenfield application. Now we have a running ACM hub. A set of managed clusters are connected to it. And a set of ACM features like GRC, Observability are enabled. Questions to answer can be\n", | ||
"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of important symptom to rely on 🙂
"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n", | |
"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or firing alerts or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n", |
"The goals are very different here. It is not about estimating the size of ACM Hub in a greenfield application. Now we have a running ACM hub. A set of managed clusters are connected to it. And a set of ACM features like GRC, Observability are enabled. Questions to answer can be\n", | ||
"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n", | ||
"- I had a problem with my system between time x and y. Can we do some attribution - `can we find out which was service was the real culprit`.\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"\n", | |
"- My system/cluster auto-scaled up to X after which it hit limits. I want to know why.\n", | |
"\n", |
"A `subtle point` :\n", | ||
"1. there are health indicators aka alerts: thanos-compact-halted.\n", | ||
"1. there are metrics which : todo-compaction (is it growing), kube-api-server-latency (is it changing) etc. \n", | ||
"1. there may be hidden metrics (or hard to measure) : is a block corrupted\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say hidden per se, as these metrics are by design discoverable through Prometheus/Thanos interfaces quite well, named as per conventions and usually come with some help text.
I think it is better to say "Hard to measure operations or events"
"Given these Causal Graph, just by using graph queries, we could easily see:\n", | ||
"- if Store Gateway has issues, then user may never be able to see historical data\n", | ||
"- if Observatorium API Gateway has issues, it may effect the both read and write\n", | ||
"- if the querier and queryfront end are both reporting issues, then given that querier is the upstream component, its the most likely candidate\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, this means some StoreAPI component like Receive, Rule, or Store is guilty. Querier errors usually manifest as OOMs (too much data) or no data and sometimes querying bad data (very rare).
"$compactor-health = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n", | ||
"-->\n", | ||
"$\\mathbf{compactor-health} = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Retention policy on compactor dictates a lot about Compactor health and the amount of work it has to do.
"$compactor-health = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n", | |
"-->\n", | |
"$\\mathbf{compactor-health} = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n", | |
"\n", | |
"$compactor-health = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block,retention-of-raw-blocks)$\n", | |
"-->\n", | |
"$\\mathbf{compactor-health} = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block,retention-of-raw-blocks)$\n", | |
"\n", |
"\n", | ||
"A causal model looks like - \n", | ||
"\n", | ||
"$\\mathbf{receiver-health} = \\mathbf{f}(time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, can-receiver-reach-obj-store)$\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Health is mostly determined by active time series https://grafana.com/docs/tempo/latest/metrics-generator/active-series/. Also this is a component with read capability, so it can serve StoreAPI requests from Querier. And Receive can be configured to retain blocks for set amount of time, after which it ships off to objstore. That is also an indicator of health as longer retention periods mean serving heavy queries alongside ingest. Would be good to factor these in as well.
"$\\mathbf{receiver-health} = \\mathbf{f}(time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, can-receiver-reach-obj-store)$\n" | |
"$\\mathbf{receiver-health} = \\mathbf{f}(active-time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, number-of-storeapi-calls, can-receiver-reach-obj-store, retention)$\n" |
"\n", | ||
"A causal model looks like - \n", | ||
"\n", | ||
"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-simaltaneous-queries, number-of-querier-replicas, receiver-health, query-cache-health)$\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would include number of samples per query or time range of query, as that dictates fanout, and number of postings that you have to read from tsdb. Also store-health.
"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-simaltaneous-queries, number-of-querier-replicas, receiver-health, query-cache-health)$\n" | |
"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-samples-per=query, number-of-simultaneous-queries, number-of-querier-replicas, receiver-health, store-health, query-cache-health)$\n" |
"\n", | ||
"A causal model looks like - \n", | ||
"\n", | ||
"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-timeseries, number-of-simaltaneous-queries,store-gw-cache-health)$" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure what history configuration is here, maybe retention?
Should include number of total blocks, and number of storegateway shards. Also PV size would need to be included as that is where it builds up block index cache
"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-timeseries, number-of-simaltaneous-queries,store-gw-cache-health)$" | |
"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-blocks, number-of-storegw-shards, number-of-timeseries-per-storeapi-call, number-of-simultaneous-queries,store-gw-cache-health, storegw-space-on-pv)$" |
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these graphs might need to change a bit based on comments above
@saswatamcode thank you for the comments. I think they are mostly clear... Will work them in. |
Signed-off-by: Joydeep Banerjee <[email protected]>
"Notes:\n", | ||
"\n", | ||
"1. For sake of Green field Capacity planning we do assume that policies or time series is uniform across all managed clusters.\n", | ||
"1. Policy complexity is hard to express numerically\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of the complexity is basically the number of resources that need to be considered (due to ranges or namespaces searched) and the number of resources actually managed by the policy.
"1. Kube API Server on the Hub\n", | ||
"1. Kube API Server on the managed cluster\n", | ||
"\n", | ||
"However, we must consider the health in lenses of what a user of the system sees. The user sees the effect when they try to `Create a Policy` or during/for `Policy Status Updates`.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really the creation (or update) of a replicated policy. If there's no placement details -- a created policy is basically disabled. As managed clusters are added or removed from the placement, this replicated policy creation is happening.
Ref: ACM-11079
@saswatamcode @gparvin changes made based on our discussion earlier.
For now you can focus only on changes made to causal-analysis/TowardsCausalThinking.ipynb