Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Causal Model #2

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Causal Model #2

wants to merge 7 commits into from

Conversation

bjoydeep
Copy link
Collaborator

@bjoydeep bjoydeep commented Jun 11, 2024

Ref: ACM-11079

@saswatamcode @gparvin changes made based on our discussion earlier.

For now you can focus only on changes made to causal-analysis/TowardsCausalThinking.ipynb

Signed-off-by: Joydeep Banerjee <[email protected]>
Signed-off-by: Joydeep Banerjee <[email protected]>
Signed-off-by: Joydeep Banerjee <[email protected]>
Signed-off-by: Joydeep Banerjee <[email protected]>
Signed-off-by: Joydeep Banerjee <[email protected]>
Copy link

openshift-ci bot commented Jun 11, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bjoydeep

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@bjoydeep
Copy link
Collaborator Author

/hold

@bjoydeep
Copy link
Collaborator Author

Few things -

Found out that search using DFS would be too naive. But doing graph search would work. So elaborated a bit on Causal Graph analysis vs Causal model. Hope it makes sense.

@saswatamcode :

  1. Also created 2 causal graphs for observability - One based on alerts. And another based on metrics. See if it is clear.
  2. For the causal graph based on metrics - some of the metrics for the nodes are obvious. But I have not indicated the metrics for the key nodes like receiver, store, ruler, querier etc. We need a metric for each. Do you want to take a stab at picking a metric for them please -- given the explanation of how the modeling works.

@gparvin

  1. As above, since we do not have alerts created for Policies yet, just created Causal Model for GRC
  2. For the causal graph based on metrics - some of the metrics for the nodes are obvious. But I have not indicated the metrics for the key nodes like policyprogagator, syncontroller etc. We need a metric for each. Do you want to take a stab at picking a metric for them please -- given the explanation of how the modeling works.

@saswatamcode
Copy link
Member

@bjoydeep thanks! Will take a look!

Copy link
Member

@saswatamcode saswatamcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for writing this up! Sorry for the delay in reviewing!
I have added some suggestions/comments on the source diff for the notebook.

Overall I think it's a good start!

"# Background\n",
"\n",
"1. We collect lots of metrics using the acm-inspector.\n",
"1. We rely on a subset of those metrics to project [acm sizes](https://github.com/stolostron/capacity-planning) for green field capacity planning exercises.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny format nit, all bullets seem to be 1.

"## Performance Analysis - on a running ACM \n",
"\n",
"The goals are very different here. It is not about estimating the size of ACM Hub in a greenfield application. Now we have a running ACM hub. A set of managed clusters are connected to it. And a set of ACM features like GRC, Observability are enabled. Questions to answer can be\n",
"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of important symptom to rely on 🙂

Suggested change
"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",
"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or firing alerts or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",

"The goals are very different here. It is not about estimating the size of ACM Hub in a greenfield application. Now we have a running ACM hub. A set of managed clusters are connected to it. And a set of ACM features like GRC, Observability are enabled. Questions to answer can be\n",
"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",
"- I had a problem with my system between time x and y. Can we do some attribution - `can we find out which was service was the real culprit`.\n",
"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"\n",
"- My system/cluster auto-scaled up to X after which it hit limits. I want to know why.\n",
"\n",

"A `subtle point` :\n",
"1. there are health indicators aka alerts: thanos-compact-halted.\n",
"1. there are metrics which : todo-compaction (is it growing), kube-api-server-latency (is it changing) etc. \n",
"1. there may be hidden metrics (or hard to measure) : is a block corrupted\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say hidden per se, as these metrics are by design discoverable through Prometheus/Thanos interfaces quite well, named as per conventions and usually come with some help text.

I think it is better to say "Hard to measure operations or events"

"Given these Causal Graph, just by using graph queries, we could easily see:\n",
"- if Store Gateway has issues, then user may never be able to see historical data\n",
"- if Observatorium API Gateway has issues, it may effect the both read and write\n",
"- if the querier and queryfront end are both reporting issues, then given that querier is the upstream component, its the most likely candidate\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, this means some StoreAPI component like Receive, Rule, or Store is guilty. Querier errors usually manifest as OOMs (too much data) or no data and sometimes querying bad data (very rare).

Comment on lines +877 to +880
"$compactor-health = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n",
"-->\n",
"$\\mathbf{compactor-health} = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n",
"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retention policy on compactor dictates a lot about Compactor health and the amount of work it has to do.

Suggested change
"$compactor-health = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n",
"-->\n",
"$\\mathbf{compactor-health} = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n",
"\n",
"$compactor-health = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block,retention-of-raw-blocks)$\n",
"-->\n",
"$\\mathbf{compactor-health} = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block,retention-of-raw-blocks)$\n",
"\n",

"\n",
"A causal model looks like - \n",
"\n",
"$\\mathbf{receiver-health} = \\mathbf{f}(time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, can-receiver-reach-obj-store)$\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health is mostly determined by active time series https://grafana.com/docs/tempo/latest/metrics-generator/active-series/. Also this is a component with read capability, so it can serve StoreAPI requests from Querier. And Receive can be configured to retain blocks for set amount of time, after which it ships off to objstore. That is also an indicator of health as longer retention periods mean serving heavy queries alongside ingest. Would be good to factor these in as well.

Suggested change
"$\\mathbf{receiver-health} = \\mathbf{f}(time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, can-receiver-reach-obj-store)$\n"
"$\\mathbf{receiver-health} = \\mathbf{f}(active-time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, number-of-storeapi-calls, can-receiver-reach-obj-store, retention)$\n"

"\n",
"A causal model looks like - \n",
"\n",
"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-simaltaneous-queries, number-of-querier-replicas, receiver-health, query-cache-health)$\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would include number of samples per query or time range of query, as that dictates fanout, and number of postings that you have to read from tsdb. Also store-health.

Suggested change
"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-simaltaneous-queries, number-of-querier-replicas, receiver-health, query-cache-health)$\n"
"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-samples-per=query, number-of-simultaneous-queries, number-of-querier-replicas, receiver-health, store-health, query-cache-health)$\n"

"\n",
"A causal model looks like - \n",
"\n",
"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-timeseries, number-of-simaltaneous-queries,store-gw-cache-health)$"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure what history configuration is here, maybe retention?
Should include number of total blocks, and number of storegateway shards. Also PV size would need to be included as that is where it builds up block index cache

Suggested change
"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-timeseries, number-of-simaltaneous-queries,store-gw-cache-health)$"
"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-blocks, number-of-storegw-shards, number-of-timeseries-per-storeapi-call, number-of-simultaneous-queries,store-gw-cache-health, storegw-space-on-pv)$"

"output_type": "execute_result"
}
],
"source": [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these graphs might need to change a bit based on comments above

@bjoydeep
Copy link
Collaborator Author

@saswatamcode thank you for the comments. I think they are mostly clear... Will work them in.

Signed-off-by: Joydeep Banerjee <[email protected]>
"Notes:\n",
"\n",
"1. For sake of Green field Capacity planning we do assume that policies or time series is uniform across all managed clusters.\n",
"1. Policy complexity is hard to express numerically\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of the complexity is basically the number of resources that need to be considered (due to ranges or namespaces searched) and the number of resources actually managed by the policy.

"1. Kube API Server on the Hub\n",
"1. Kube API Server on the managed cluster\n",
"\n",
"However, we must consider the health in lenses of what a user of the system sees. The user sees the effect when they try to `Create a Policy` or during/for `Policy Status Updates`.\n",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really the creation (or update) of a replicated policy. If there's no placement details -- a created policy is basically disabled. As managed clusters are added or removed from the placement, this replicated policy creation is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants