|
| 1 | +# Conventions |
| 2 | + |
| 3 | +Always favour convention over configuration. And any configuration |
| 4 | +should have sensible defaults. |
| 5 | + |
| 6 | +## Naming Conventions |
| 7 | + |
| 8 | +### Resources |
| 9 | + |
| 10 | +The key alert attribute name of `resource` was specifically chosen |
| 11 | +so as not to be host centric. A resource *can* be a hostname, but it |
| 12 | +might also be an EC2 instance ID, a Docker container ID or some other |
| 13 | +type of non-host unique identifier. |
| 14 | + |
| 15 | +### Environments & Services |
| 16 | + |
| 17 | +The environment attribute is used to [namespace](https://en.wikipedia.org/wiki/Namespace) |
| 18 | +the alert resource. This allows you to have two resources with the same |
| 19 | +name (eg. `web01`) but that are differentiated by their environments |
| 20 | +(eg. `Production` and `Development`). |
| 21 | + |
| 22 | +Choose a set of environments and enforce them. ie. `PROD`, `DEV` |
| 23 | +or `Production`, `Development` but not both. The same for services |
| 24 | +eg. `MobileAPI`, `Mobile-API` and `mobile api` are all valid |
| 25 | +but needlessly different and impossible to query for consistently |
| 26 | +or generate aggregate metrics for. |
| 27 | + |
| 28 | +Note that the **_service attribute is a list_** because it is common |
| 29 | +for infrastructure (ie. a resource) to be used by more than one service. |
| 30 | +That is, if a component failure occurs that problem could cause an |
| 31 | +outage in multiple services. |
| 32 | + |
| 33 | +### Event Names |
| 34 | + |
| 35 | +It can be useful to define a convention when it comes to naming |
| 36 | +events. Possible options are: |
| 37 | + |
| 38 | +* Camel case - `DiskUtilHigh` |
| 39 | +* Hierarchy - `NW:INTERFACE:DOWN` |
| 40 | +* SNMP - `cpuAlarmHigh` |
| 41 | + |
| 42 | +Querying for all Disk utilisation alerts using the `alerta` CLI |
| 43 | +is then relatively straight-forward:: |
| 44 | + |
| 45 | + $ alerta query --filter event=~DiskUtil |
| 46 | + |
| 47 | +### Event Groups |
| 48 | + |
| 49 | +Another consideration is to ensure you make use of the event group |
| 50 | +which gives you the ability to group related alerts. |
| 51 | + |
| 52 | +Some suggested event groups with possible events are listed below. |
| 53 | + |
| 54 | +| Event Groups | Events (examples) | |
| 55 | +|--------------------|--------------------------------------------| |
| 56 | +| `Service` | failures with entire services | |
| 57 | +| `Application` | errors from application logs | |
| 58 | +| `OS` | disk space, time sync failing | |
| 59 | +| `Performance` | system load, swap utilisation high | |
| 60 | +| `Configuration` | config mgmt tool alerts eg. Puppet or Chef | |
| 61 | +| `Web` | web server errors | |
| 62 | +| `Syslog` | unix system log messages | |
| 63 | +| `Hardware` | hardware errors | |
| 64 | +| `Storage` | NFS, SAN, NAS storage infrastructure | |
| 65 | +| `Database` | database errors, table space utilisation | |
| 66 | +| `Security` | security/authorization messages | |
| 67 | +| `Network` | network devices and infrastructure | |
| 68 | +| `Cloud` | cloud-based services or infrastructure | |
| 69 | + |
| 70 | +Querying for all performance-related alerts using the `alerta` CLI |
| 71 | +could then become:: |
| 72 | + |
| 73 | + $ alerta query --filter group=Performance |
| 74 | + |
| 75 | +### Severity Levels |
| 76 | + |
| 77 | +Agree on a subset of [severity levels](api/alert.rst#alert-severities) and |
| 78 | +be consistent with what they mean. For example, if severity levels are used |
| 79 | +consistently then integrating with a paging or email system becomes easier. |
| 80 | + |
| 81 | +| Severity | Service Level | Notification | |
| 82 | +|--------------|----------------------------------|--------------------------------| |
| 83 | +| `critical` | service unavailable | immediate page out | |
| 84 | +| `major` | service impaired still available | page during business hours | |
| 85 | +| `minor` | component failure | email only | |
| 86 | +| `warning` | everything else | consolidate into daily email | |
| 87 | + |
| 88 | +## Enforcing Conventions |
| 89 | + |
| 90 | +Once a set of naming conventions are agreed, they can be enforced by |
| 91 | +using a simple "pre-receive" plugin, similar to a [`git` hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks). |
| 92 | + |
| 93 | +A full working example called [reject][reject] can be found in the plugins |
| 94 | +directory of the project code repository and is installed by default. |
| 95 | +The server configuration settings {envvar}`ORIGIN_BLACKLIST` and |
| 96 | +{envvar}`ALLOWED_ENVIRONMENTS` can be used to tailor it for your |
| 97 | +circumstances or it can be disabled completely. |
| 98 | + |
| 99 | +[reject]: https://github.com/alerta/alerta/blob/master/alerta/plugins/reject.py |
0 commit comments