Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper duration #541

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

Zookeeper duration #541

wants to merge 16 commits into from

Conversation

souBel
Copy link

@souBel souBel commented Jan 9, 2024

Add zokeeper-health and zookeeper-latency parameters

@@ -0,0 +1,16 @@
module: zookeeper
name: zookeeper-health
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: zookeeper-health
name: health

@@ -0,0 +1,22 @@
module: zookeeper
name: zookeeper-latency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: zookeeper-latency
name: latency

lasting_duration: "5m"
latency_disabled: "false"
major:
threshold: 250000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical & major thresholds seems ta have too close values.

name: zookeeper-health
transformation: false
aggregation: true
exclude_not_running_vm: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary ?

name: zookeeper-latency
transformation: false
aggregation: true
exclude_not_running_vm: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary ?

@@ -26,7 +26,7 @@ EOF
max_delay = var.heartbeat_max_delay
}

resource "signalfx_detector" "zookeeper_health" {
/*resource "signalfx_detector" "zookeeper_health" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

@@ -44,7 +44,7 @@ variable "heartbeat_aggregation_function" {
default = ""
}

# zookeeper_health detector
/*# zookeeper_health detector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

@@ -0,0 +1,16 @@
module: zookeeper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detector should aggregate on all servers in the cluster and trigger a major on loss of part of the servers (half ? third ?) and critical on loss of more than that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example :

signal = data('gauge.zk_service_health', filter=filter('env', 'preprod') and filter('sfx_monitored', 'true')).mean(by=['plugin_instance']).publish('signal')
detect(when(signal < 0.66, lasting='5m', at_least=1)).publish('CRIT')
detect(when(signal < 1, lasting='5m', at_least=1)).publish('MAJ')```

signal:
metric: "gauge.zk_avg_latency"
rules:
critical:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that a high latency on one server should trigger a critical alter.
Maybe 2 detectors:

  • one that trigger major and critical if all servers in a cluster have high latency
  • one that trigger major for a single server high latency

@freezeeedos
Copy link
Contributor

Work in progress, still needs some cleanup, but the detector was split as recommended, and behavior is has expected
There's one "major" detector which looks at individual nodes, and a "critical" one which looks at whole clusters

@souBel
Copy link
Author

souBel commented Jan 12, 2024

Cleanup's details done. Please check our last changes and tell us if all is right now

@souBel souBel requested a review from BzSpi January 12, 2024 09:48
@souBel
Copy link
Author

souBel commented Jan 15, 2024

We also split server-health detector : one critical for cluster and one major for single server
Please checkout

@souBel souBel requested a review from Gati0 January 15, 2024 11:06
@souBel
Copy link
Author

souBel commented Feb 20, 2024

Hello,
Were you able to checkout if everything is OK with new commits (the last) ?

@@ -0,0 +1,14 @@
module: zookeeper
name: server-health
disabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, this is the default value

@@ -0,0 +1,14 @@
module: zookeeper
name: cluster-health
disabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, this is the default value

module: zookeeper
name: server-latency
aggregation: false
disabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, this is the default value

module: zookeeper
name: cluster-latency
aggregation: ".mean(by=['kubernetes_cluster'])"
disabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, this is the default value

comparator: ">"
description: "Zookeeper cluster latency is too high"
lasting_duration: "5m"
latency_disabled: "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latency_disabled variable does not exist, I think this line should be deleted

comparator: ">"
description: "Zookeeper server latency is too high"
lasting_duration: "5m"
latency_disabled: "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

latency_disabled variable does not exist, I think this line should be deleted

comparator: "=="
description: "Zookeeper cluster is not running"
lasting_duration: "5m"
health_disabled: "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health_disabled variable does not exist, I think this line should be deleted

comparator: "!="
description: "Zookeeper server is not running"
lasting_duration: "5m"
health_disabled: "false"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health_disabled variable does not exist, I think this line should be deleted

@souBel
Copy link
Author

souBel commented Mar 13, 2024

Hello,
I updated your recommendations as requested.
Could you check if everything is OK now ?

@souBel souBel requested a review from haedri March 13, 2024 11:18
@souBel
Copy link
Author

souBel commented Mar 28, 2024

Hello,
Were you able to check it out ?

@souBel
Copy link
Author

souBel commented May 6, 2024

Any update please ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants