Causal Model #2

bjoydeep · 2024-06-11T01:31:53Z

Ref: ACM-11079

@saswatamcode @gparvin changes made based on our discussion earlier.

For now you can focus only on changes made to causal-analysis/TowardsCausalThinking.ipynb

Signed-off-by: Joydeep Banerjee <[email protected]>

openshift-ci · 2024-06-11T01:32:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bjoydeep

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bjoydeep · 2024-06-11T01:35:10Z

/hold

bjoydeep · 2024-06-11T01:47:43Z

Few things -

Found out that search using DFS would be too naive. But doing graph search would work. So elaborated a bit on Causal Graph analysis vs Causal model. Hope it makes sense.

@saswatamcode :

Also created 2 causal graphs for observability - One based on alerts. And another based on metrics. See if it is clear.
For the causal graph based on metrics - some of the metrics for the nodes are obvious. But I have not indicated the metrics for the key nodes like receiver, store, ruler, querier etc. We need a metric for each. Do you want to take a stab at picking a metric for them please -- given the explanation of how the modeling works.

@gparvin

As above, since we do not have alerts created for Policies yet, just created Causal Model for GRC
For the causal graph based on metrics - some of the metrics for the nodes are obvious. But I have not indicated the metrics for the key nodes like policyprogagator, syncontroller etc. We need a metric for each. Do you want to take a stab at picking a metric for them please -- given the explanation of how the modeling works.

saswatamcode · 2024-06-11T05:31:01Z

@bjoydeep thanks! Will take a look!

saswatamcode

Thanks a lot for writing this up! Sorry for the delay in reviewing!
I have added some suggestions/comments on the source diff for the notebook.

Overall I think it's a good start!

saswatamcode · 2024-06-14T06:17:08Z

causal-analysis/TowardsCausalThinking.ipynb

+    "# Background\n",
+    "\n",
+    "1. We collect lots of metrics using the acm-inspector.\n",
+    "1. We rely on a subset of those metrics to project [acm sizes](https://github.com/stolostron/capacity-planning) for green field capacity planning exercises.\n",


Tiny format nit, all bullets seem to be 1.

saswatamcode · 2024-06-14T06:23:25Z

causal-analysis/TowardsCausalThinking.ipynb

+    "## Performance Analysis - on a running ACM \n",
+    "\n",
+    "The goals are very different here. It is not about estimating the size of ACM Hub in a greenfield application. Now we have a running ACM hub. A set of managed clusters are connected to it. And a set of ACM features like GRC, Observability are enabled. Questions to answer can be\n",
+    "- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",


Kind of important symptom to rely on 🙂

Suggested change

"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",

"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or firing alerts or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",

saswatamcode · 2024-06-14T06:30:22Z

causal-analysis/TowardsCausalThinking.ipynb

+    "The goals are very different here. It is not about estimating the size of ACM Hub in a greenfield application. Now we have a running ACM hub. A set of managed clusters are connected to it. And a set of ACM features like GRC, Observability are enabled. Questions to answer can be\n",
+    "- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",
+    "- I had a problem with my system between time x and y. Can we do some attribution - `can we find out which was service was the real culprit`.\n",
+    "\n",


Suggested change

"\n",

"- My system/cluster auto-scaled up to X after which it hit limits. I want to know why.\n",

"\n",

saswatamcode · 2024-06-14T06:38:14Z

causal-analysis/TowardsCausalThinking.ipynb

+    "A `subtle point` :\n",
+    "1. there are health indicators aka alerts: thanos-compact-halted.\n",
+    "1. there are metrics which : todo-compaction (is it growing), kube-api-server-latency (is it changing) etc. \n",
+    "1. there may be hidden metrics (or hard to measure) : is a block corrupted\n",


I wouldn't say hidden per se, as these metrics are by design discoverable through Prometheus/Thanos interfaces quite well, named as per conventions and usually come with some help text.

I think it is better to say "Hard to measure operations or events"

saswatamcode · 2024-06-14T06:42:14Z

causal-analysis/TowardsCausalThinking.ipynb

+    "Given these Causal Graph, just by using graph queries, we could easily see:\n",
+    "- if Store Gateway has issues, then user may never be able to see historical data\n",
+    "- if Observatorium API Gateway has issues, it may effect the both read and write\n",
+    "- if the querier and queryfront end are both reporting issues, then given that querier is the upstream component, its the most likely candidate\n",


Usually, this means some StoreAPI component like Receive, Rule, or Store is guilty. Querier errors usually manifest as OOMs (too much data) or no data and sometimes querying bad data (very rare).

saswatamcode · 2024-06-14T06:58:54Z

causal-analysis/TowardsCausalThinking.ipynb

+    "$compactor-health = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n",
+    "-->\n",
+    "$\\mathbf{compactor-health} = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n",
+    "\n",


Retention policy on compactor dictates a lot about Compactor health and the amount of work it has to do.

Suggested change

"$compactor-health = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n",

"-->\n",

"$\\mathbf{compactor-health} = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block)$\n",

"\n",

"$compactor-health = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block,retention-of-raw-blocks)$\n",

"-->\n",

"$\\mathbf{compactor-health} = \\mathbf{f}(is-compactor-out-ofspace,is-there-a-bad-block,retention-of-raw-blocks)$\n",

"\n",

saswatamcode · 2024-06-14T07:02:29Z

causal-analysis/TowardsCausalThinking.ipynb

+    "\n",
+    "A causal model looks like - \n",
+    "\n",
+    "$\\mathbf{receiver-health} = \\mathbf{f}(time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, can-receiver-reach-obj-store)$\n"


Health is mostly determined by active time series https://grafana.com/docs/tempo/latest/metrics-generator/active-series/. Also this is a component with read capability, so it can serve StoreAPI requests from Querier. And Receive can be configured to retain blocks for set amount of time, after which it ships off to objstore. That is also an indicator of health as longer retention periods mean serving heavy queries alongside ingest. Would be good to factor these in as well.

Suggested change

"$\\mathbf{receiver-health} = \\mathbf{f}(time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, can-receiver-reach-obj-store)$\n"

"$\\mathbf{receiver-health} = \\mathbf{f}(active-time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, number-of-storeapi-calls, can-receiver-reach-obj-store, retention)$\n"

saswatamcode · 2024-06-14T07:05:15Z

causal-analysis/TowardsCausalThinking.ipynb

+    "\n",
+    "A causal model looks like - \n",
+    "\n",
+    "$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-simaltaneous-queries, number-of-querier-replicas, receiver-health, query-cache-health)$\n"


I would include number of samples per query or time range of query, as that dictates fanout, and number of postings that you have to read from tsdb. Also store-health.

Suggested change

"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-simaltaneous-queries, number-of-querier-replicas, receiver-health, query-cache-health)$\n"

"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-samples-per=query, number-of-simultaneous-queries, number-of-querier-replicas, receiver-health, store-health, query-cache-health)$\n"

saswatamcode · 2024-06-14T07:08:33Z

causal-analysis/TowardsCausalThinking.ipynb

+    "\n",
+    "A causal model looks like - \n",
+    "\n",
+    "$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-timeseries, number-of-simaltaneous-queries,store-gw-cache-health)$"


Hmm not sure what history configuration is here, maybe retention?
Should include number of total blocks, and number of storegateway shards. Also PV size would need to be included as that is where it builds up block index cache

Suggested change

"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-timeseries, number-of-simaltaneous-queries,store-gw-cache-health)$"

"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-blocks, number-of-storegw-shards, number-of-timeseries-per-storeapi-call, number-of-simultaneous-queries,store-gw-cache-health, storegw-space-on-pv)$"

saswatamcode · 2024-06-14T07:12:42Z

causal-analysis/TowardsCausalThinking.ipynb

+     "output_type": "execute_result"
+    }
+   ],
+   "source": [


I think these graphs might need to change a bit based on comments above

bjoydeep · 2024-06-14T14:18:45Z

@saswatamcode thank you for the comments. I think they are mostly clear... Will work them in.

Signed-off-by: Joydeep Banerjee <[email protected]>

gparvin · 2024-06-21T20:42:50Z

causal-analysis/TowardsCausalThinking.ipynb

+    "Notes:\n",
+    "\n",
+    "1. For sake of Green field Capacity planning we do assume that policies or time series is uniform across all managed clusters.\n",
+    "1. Policy complexity is hard to express numerically\n",


A lot of the complexity is basically the number of resources that need to be considered (due to ranges or namespaces searched) and the number of resources actually managed by the policy.

gparvin · 2024-06-21T21:14:12Z

causal-analysis/TowardsCausalThinking.ipynb

+    "1. Kube API Server on the Hub\n",
+    "1. Kube API Server on the managed cluster\n",
+    "\n",
+    "However, we must consider the health in lenses of what a user of the system sees. The user sees the effect when they try to `Create a Policy` or during/for `Policy Status Updates`.\n",


It's really the creation (or update) of a replicated policy. If there's no placement details -- a created policy is basically disabled. As managed clusters are added or removed from the placement, this replicated policy creation is happening.

bjoydeep added 6 commits May 22, 2024 14:21

causal map

6a266e8

Signed-off-by: Joydeep Banerjee <[email protected]>

causal map

1bd8315

Signed-off-by: Joydeep Banerjee <[email protected]>

causal map

1f48da1

Signed-off-by: Joydeep Banerjee <[email protected]>

causal map

823549b

Signed-off-by: Joydeep Banerjee <[email protected]>

causal map

24b3412

Signed-off-by: Joydeep Banerjee <[email protected]>

changes to causal details after round-2 discussions

fba8b7c

Signed-off-by: Joydeep Banerjee <[email protected]>

openshift-ci bot added the do-not-merge/hold label Jun 11, 2024

saswatamcode reviewed Jun 14, 2024

View reviewed changes

reverting back changes to 3 files

4feceaa

Signed-off-by: Joydeep Banerjee <[email protected]>

gparvin reviewed Jun 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Causal Model #2

Causal Model #2

bjoydeep commented Jun 11, 2024 •

edited

Loading

openshift-ci bot commented Jun 11, 2024

bjoydeep commented Jun 11, 2024

bjoydeep commented Jun 11, 2024

saswatamcode commented Jun 11, 2024

saswatamcode left a comment

saswatamcode Jun 14, 2024

saswatamcode Jun 14, 2024

saswatamcode Jun 14, 2024

saswatamcode Jun 14, 2024

saswatamcode Jun 14, 2024

saswatamcode Jun 14, 2024

saswatamcode Jun 14, 2024

saswatamcode Jun 14, 2024

saswatamcode Jun 14, 2024

saswatamcode Jun 14, 2024

bjoydeep commented Jun 14, 2024

gparvin Jun 21, 2024

gparvin Jun 21, 2024

	"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",
	"- Is my system approaching some kind of limit which may not manifest itself simply in CPU, Memory consumption. It may be responding slower, it may be dropping data or firing alerts or there could be other manifestations. The question `do we need to add more capacity` can be approached now with much more details. We never had the luxury of those details when we sized for green field.\n",

	"\n",
	"- My system/cluster auto-scaled up to X after which it hit limits. I want to know why.\n",
	"\n",

	"$\\mathbf{receiver-health} = \\mathbf{f}(time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, can-receiver-reach-obj-store)$\n"
	"$\\mathbf{receiver-health} = \\mathbf{f}(active-time-series, number-of-receiver-replicas, is-receiver-pv-out-of-space, number-of-storeapi-calls, can-receiver-reach-obj-store, retention)$\n"

	"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-simaltaneous-queries, number-of-querier-replicas, receiver-health, query-cache-health)$\n"
	"$\\mathbf{querier-health} = \\mathbf{f}(number-of-time-series-per-query, number-of-samples-per=query, number-of-simultaneous-queries, number-of-querier-replicas, receiver-health, store-health, query-cache-health)$\n"

	"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-timeseries, number-of-simaltaneous-queries,store-gw-cache-health)$"
	"$\\mathbf{storegw-health} = \\mathbf{f}(history-configuration, number-of-blocks, number-of-storegw-shards, number-of-timeseries-per-storeapi-call, number-of-simultaneous-queries,store-gw-cache-health, storegw-space-on-pv)$"

Causal Model #2

Are you sure you want to change the base?

Causal Model #2

Conversation

bjoydeep commented Jun 11, 2024 • edited Loading

openshift-ci bot commented Jun 11, 2024

bjoydeep commented Jun 11, 2024

bjoydeep commented Jun 11, 2024

saswatamcode commented Jun 11, 2024

saswatamcode left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjoydeep commented Jun 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjoydeep commented Jun 11, 2024 •

edited

Loading