Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper duration #541

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
2 changes: 2 additions & 0 deletions docs/severity.md
Original file line number Diff line number Diff line change
Expand Up @@ -1311,6 +1311,8 @@

|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|Zookeeper zookeeper-health|X|-|-|-|-|
|Zookeeper zookeeper-latency|X|X|-|-|-|
|Zookeeper heartbeat|X|-|-|-|-|
|Zookeeper service health|X|-|-|-|-|
|Zookeeper latency|X|X|-|-|-|
Expand Down
4 changes: 3 additions & 1 deletion modules/smart-agent_zookeeper/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ Note the following parameters:

These 3 parameters along with all variables defined in [common-variables.tf](common-variables.tf) are common to all
[modules](../) in this repository. Other variables, specific to this module, are available in
[variables.tf](variables.tf).
[variables.tf](variables.tf) and [variables-gen.tf](variables-gen.tf).
In general, the default configuration "works" but all of these Terraform
[variables](https://www.terraform.io/language/values/variables) make it possible to
customize the detectors behavior to better fit your needs.
Expand All @@ -77,6 +77,8 @@ This module creates the following SignalFx detectors which could contain one or

|Detector|Critical|Major|Minor|Warning|Info|
|---|---|---|---|---|---|
|Zookeeper zookeeper-health|X|-|-|-|-|
|Zookeeper zookeeper-latency|X|X|-|-|-|
|Zookeeper heartbeat|X|-|-|-|-|
|Zookeeper service health|X|-|-|-|-|
|Zookeeper latency|X|X|-|-|-|
Expand Down
16 changes: 16 additions & 0 deletions modules/smart-agent_zookeeper/conf/00-zookeeper-health.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
module: zookeeper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detector should aggregate on all servers in the cluster and trigger a major on loss of part of the servers (half ? third ?) and critical on loss of more than that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example :

signal = data('gauge.zk_service_health', filter=filter('env', 'preprod') and filter('sfx_monitored', 'true')).mean(by=['plugin_instance']).publish('signal')
detect(when(signal < 0.66, lasting='5m', at_least=1)).publish('CRIT')
detect(when(signal < 1, lasting='5m', at_least=1)).publish('MAJ')```

name: zookeeper-health
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: zookeeper-health
name: health

transformation: false
aggregation: true
exclude_not_running_vm: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary ?

disabled: false
signals:
signal:
metric: "gauge.zk_service_health"
rules:
critical:
threshold: 1
comparator: "!="
description: "is not running"
lasting_duration: "5m"
health_disabled: "false"
22 changes: 22 additions & 0 deletions modules/smart-agent_zookeeper/conf/01-zookeeper-latency.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
module: zookeeper
name: zookeeper-latency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: zookeeper-latency
name: latency

transformation: false
aggregation: true
exclude_not_running_vm: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary ?

disabled: false
signals:
signal:
metric: "gauge.zk_avg_latency"
rules:
critical:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that a high latency on one server should trigger a critical alter.
Maybe 2 detectors:

  • one that trigger major and critical if all servers in a cluster have high latency
  • one that trigger major for a single server high latency

threshold: 300000
comparator: ">"
description: "is too high"
lasting_duration: "5m"
latency_disabled: "false"
major:
threshold: 250000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical & major thresholds seems ta have too close values.

comparator: ">"
description: "is too high"
lasting_duration: "5m"
latency_disabled: "false"
67 changes: 67 additions & 0 deletions modules/smart-agent_zookeeper/detectors-gen.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
resource "signalfx_detector" "zookeeper-health" {
name = format("%s %s", local.detector_name_prefix, "Zookeeper zookeeper-health")

authorized_writer_teams = var.authorized_writer_teams
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

program_text = <<-EOF
signal = data('gauge.zk_service_health', filter=${local.not_running_vm_filters} and ${module.filtering.signalflow})${var.zookeeper-health_aggregation_function}.publish('signal')
detect(when(signal != ${var.zookeeper-health_threshold_critical}, lasting=%{if var.zookeeper-health_lasting_duration_critical == null}None%{else}'${var.zookeeper-health_lasting_duration_critical}'%{endif}, at_least=${var.zookeeper-health_at_least_percentage_critical})).publish('CRIT')
EOF

rule {
description = "is not running != ${var.zookeeper-health_threshold_critical}"
severity = "Critical"
detect_label = "CRIT"
disabled = coalesce(var.zookeeper-health_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.zookeeper-health_notifications, "critical", []), var.notifications.critical), null)
runbook_url = try(coalesce(var.zookeeper-health_runbook_url, var.runbook_url), "")
tip = var.zookeeper-health_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

max_delay = var.zookeeper-health_max_delay
}

resource "signalfx_detector" "zookeeper-latency" {
name = format("%s %s", local.detector_name_prefix, "Zookeeper zookeeper-latency")

authorized_writer_teams = var.authorized_writer_teams
teams = try(coalescelist(var.teams, var.authorized_writer_teams), null)
tags = compact(concat(local.common_tags, local.tags, var.extra_tags))

program_text = <<-EOF
signal = data('gauge.zk_avg_latency', filter=${local.not_running_vm_filters} and ${module.filtering.signalflow})${var.zookeeper-latency_aggregation_function}.publish('signal')
detect(when(signal > ${var.zookeeper-latency_threshold_critical}, lasting=%{if var.zookeeper-latency_lasting_duration_critical == null}None%{else}'${var.zookeeper-latency_lasting_duration_critical}'%{endif}, at_least=${var.zookeeper-latency_at_least_percentage_critical})).publish('CRIT')
detect(when(signal > ${var.zookeeper-latency_threshold_major}, lasting=%{if var.zookeeper-latency_lasting_duration_major == null}None%{else}'${var.zookeeper-latency_lasting_duration_major}'%{endif}, at_least=${var.zookeeper-latency_at_least_percentage_major})).publish('MAJOR')
EOF

rule {
description = "is too high > ${var.zookeeper-latency_threshold_critical}"
severity = "Critical"
detect_label = "CRIT"
disabled = coalesce(var.zookeeper-latency_disabled_critical, var.zookeeper-latency_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.zookeeper-latency_notifications, "critical", []), var.notifications.critical), null)
runbook_url = try(coalesce(var.zookeeper-latency_runbook_url, var.runbook_url), "")
tip = var.zookeeper-latency_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

rule {
description = "is too high > ${var.zookeeper-latency_threshold_major}"
severity = "Major"
detect_label = "MAJOR"
disabled = coalesce(var.zookeeper-latency_disabled_major, var.zookeeper-latency_disabled, var.detectors_disabled)
notifications = try(coalescelist(lookup(var.zookeeper-latency_notifications, "major", []), var.notifications.major), null)
runbook_url = try(coalesce(var.zookeeper-latency_runbook_url, var.runbook_url), "")
tip = var.zookeeper-latency_tip
parameterized_subject = var.message_subject == "" ? local.rule_subject : var.message_subject
parameterized_body = var.message_body == "" ? local.rule_body : var.message_body
}

max_delay = var.zookeeper-latency_max_delay
}

8 changes: 4 additions & 4 deletions modules/smart-agent_zookeeper/detectors-zookeeper.tf
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ EOF
max_delay = var.heartbeat_max_delay
}

resource "signalfx_detector" "zookeeper_health" {
/*resource "signalfx_detector" "zookeeper_health" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

name = format("%s %s", local.detector_name_prefix, "Zookeeper service health")

authorized_writer_teams = var.authorized_writer_teams
Expand All @@ -51,9 +51,9 @@ EOF
}

max_delay = var.zookeeper_health_max_delay
}
}*/

resource "signalfx_detector" "zookeeper_latency" {
/*resource "signalfx_detector" "zookeeper_latency" {
name = format("%s %s", local.detector_name_prefix, "Zookeeper latency")

authorized_writer_teams = var.authorized_writer_teams
Expand Down Expand Up @@ -91,7 +91,7 @@ EOF
}

max_delay = var.zookeeper_latency_max_delay
}
}*/

resource "signalfx_detector" "file_descriptors" {
name = format("%s %s", local.detector_name_prefix, "Zookeeper file descriptors usage")
Expand Down
12 changes: 6 additions & 6 deletions modules/smart-agent_zookeeper/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ output "heartbeat" {
value = signalfx_detector.heartbeat
}

output "zookeeper_health" {
description = "Detector resource for zookeeper_health"
value = signalfx_detector.zookeeper_health
output "zookeeper-health" {
description = "Detector resource for zookeeper-health"
value = signalfx_detector.zookeeper-health
}

output "zookeeper_latency" {
description = "Detector resource for zookeeper_latency"
value = signalfx_detector.zookeeper_latency
output "zookeeper-latency" {
description = "Detector resource for zookeeper-latency"
value = signalfx_detector.zookeeper-latency
}

139 changes: 139 additions & 0 deletions modules/smart-agent_zookeeper/variables-gen.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# zookeeper-health detector

variable "zookeeper-health_notifications" {
description = "Notification recipients list per severity overridden for zookeeper-health detector"
type = map(list(string))
default = {}
}

variable "zookeeper-health_aggregation_function" {
description = "Aggregation function and group by for zookeeper-health detector (i.e. \".mean(by=['host'])\")"
type = string
default = ""
}

variable "zookeeper-health_max_delay" {
description = "Enforce max delay for zookeeper-health detector (use \"0\" or \"null\" for \"Auto\")"
type = number
default = null
}

variable "zookeeper-health_tip" {
description = "Suggested first course of action or any note useful for incident handling"
type = string
default = ""
}

variable "zookeeper-health_runbook_url" {
description = "URL like SignalFx dashboard or wiki page which can help to troubleshoot the incident cause"
type = string
default = ""
}

variable "zookeeper-health_disabled" {
description = "Disable all alerting rules for zookeeper-health detector"
type = bool
default = null
}

variable "zookeeper-health_threshold_critical" {
description = "Critical threshold for zookeeper-health detector"
type = number
default = 1
}

variable "zookeeper-health_lasting_duration_critical" {
description = "Minimum duration that conditions must be true before raising alert"
type = string
default = "5m"
}

variable "zookeeper-health_at_least_percentage_critical" {
description = "Percentage of lasting that conditions must be true before raising alert (>= 0.0 and <= 1.0)"
type = number
default = 1
}
# zookeeper-latency detector

variable "zookeeper-latency_notifications" {
description = "Notification recipients list per severity overridden for zookeeper-latency detector"
type = map(list(string))
default = {}
}

variable "zookeeper-latency_aggregation_function" {
description = "Aggregation function and group by for zookeeper-latency detector (i.e. \".mean(by=['host'])\")"
type = string
default = ""
}

variable "zookeeper-latency_max_delay" {
description = "Enforce max delay for zookeeper-latency detector (use \"0\" or \"null\" for \"Auto\")"
type = number
default = null
}

variable "zookeeper-latency_tip" {
description = "Suggested first course of action or any note useful for incident handling"
type = string
default = ""
}

variable "zookeeper-latency_runbook_url" {
description = "URL like SignalFx dashboard or wiki page which can help to troubleshoot the incident cause"
type = string
default = ""
}

variable "zookeeper-latency_disabled" {
description = "Disable all alerting rules for zookeeper-latency detector"
type = bool
default = null
}

variable "zookeeper-latency_disabled_critical" {
description = "Disable critical alerting rule for zookeeper-latency detector"
type = bool
default = null
}

variable "zookeeper-latency_disabled_major" {
description = "Disable major alerting rule for zookeeper-latency detector"
type = bool
default = null
}

variable "zookeeper-latency_threshold_critical" {
description = "Critical threshold for zookeeper-latency detector"
type = number
default = 300000
}

variable "zookeeper-latency_lasting_duration_critical" {
description = "Minimum duration that conditions must be true before raising alert"
type = string
default = "5m"
}

variable "zookeeper-latency_at_least_percentage_critical" {
description = "Percentage of lasting that conditions must be true before raising alert (>= 0.0 and <= 1.0)"
type = number
default = 1
}
variable "zookeeper-latency_threshold_major" {
description = "Major threshold for zookeeper-latency detector"
type = number
default = 250000
}

variable "zookeeper-latency_lasting_duration_major" {
description = "Minimum duration that conditions must be true before raising alert"
type = string
default = "5m"
}

variable "zookeeper-latency_at_least_percentage_major" {
description = "Percentage of lasting that conditions must be true before raising alert (>= 0.0 and <= 1.0)"
type = number
default = 1
}
4 changes: 2 additions & 2 deletions modules/smart-agent_zookeeper/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ variable "heartbeat_aggregation_function" {
default = ""
}

# zookeeper_health detector
/*# zookeeper_health detector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove


variable "zookeeper_health_max_delay" {
description = "Enforce max delay for zookeeper_health detector (use \"0\" or \"null\" for \"Auto\")"
Expand Down Expand Up @@ -160,7 +160,7 @@ variable "zookeeper_latency_threshold_major" {
description = "Major threshold for zookeeper_latency detector"
type = number
default = 250000
}
}*/

# file_descriptors detector

Expand Down
Loading