Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions src/part-3/chapter-10/alerting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
## **Alerting**

When an alerting rule is evaluated by a Borgmon, the result is either true, in which case the alert is triggered, or false. Experience shows that alerts can “flap” (toggle their state quickly); therefore, the rules allow a minimum duration for which the alerting rule must be true before the alert is sent. Typically, this duration is set to at least two rule evaluation cycles to ensure no missed collections cause a false alert.

The following example creates an alert when the error ratio over 10 minutes exceeds 1% and the total number of errors exceeds 1:

```perl
rules <<<
{var=dc:http_errors:ratio_rate10m,job=webserver} > 0.01
and by job, error
{var=dc:http_errors:rate10m,job=webserver} > 1
for 2m
=> ErrorRatioTooHigh
details "webserver error ratio at [[trigger_value]]"
labels {severity=page};
>>>
```

Our example holds the ratio rate at 0.15, which is well over the threshold of 0.01 in the alerting rule. However, the number of errors is not greater than 1 at this moment, so the alert won’t be active. Once the number of errors exceeds 1, the alert will go pending for two minutes to ensure it isn’t a transient state, and only then will it fire.

The alert rule contains a small template for filling out a message containing contex‐ tual information: which job the alert is for, the name of the alert, the numerical value of the triggering rule, and so on. The contextual information is filled out by Borgmon when the alert fires and is sent in the Alert RPC.

Borgmon is connected to a centrally run service, known as the Alertmanager, which receives Alert RPCs when the rule first triggers, and then again when the alert is con‐ sidered to be “firing.” The Alertmanager is responsible for routing the alert notifica‐ tion to the correct destination. Alertmanager can be configured to do the following:

* Inhibit certain alerts when others are active
* Deduplicate alerts from multiple Borgmon that have the same labelsets
* Fan-in or fan-out alerts based on their labelsets when multiple alerts with similar labelsets fire

As described in Chapter 6, teams send their page-worthy alerts to their on-call rota‐ tion and their important but subcritical alerts to their ticket queues. All other alerts should be retained as informational data for status dashboards.

A more comprehensive guide to alert design can be found in Chapter 4.
9 changes: 9 additions & 0 deletions src/part-3/chapter-10/black-box_monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
## **Black-Box Monitoring**

Borgmon is a white-box monitoring system—it inspects the internal state of the target service, and the rules are written with knowledge of the internals in mind. The transparent nature of this model provides great power to identify quickly what components are failing, which queues are full, and where bottlenecks occur, both when responding to an incident and when testing a new feature deployment.

However, white-box monitoring does not provide a full picture of the system being monitored; relying solely upon white-box monitoring means that you aren’t aware of what the users see. You only see the queries that arrive at the target; the queries that never make it due to a DNS error are invisible, while queries lost due to a server crash never make a sound. You can only alert on the failures that you expected.

Teams at Google solve this coverage issue with Prober, which runs a protocol check against a target and reports success or failure. The prober can send alerts directly to Alertmanager, or its own varz can be collected by a Borgmon. Prober can validate the response payload of the protocol (e.g., the HTML contents of an HTTP response) and validate that the contents are expected, and even extract and export values as time-series. Teams often use Prober to export histograms of response times by operation type and payload size so that they can slice and dice the user-visible performance. Prober is a hybrid of the check-and-test model with some richer variable extraction to create time-series.

Prober can be pointed at either the frontend domain or behind the load balancer. By using both targets, we can detect localized failures and suppress alerts. For example, we might monitor both the load balanced www.google.com and the web servers in each datacenter behind the load balancer. This setup allows us to either know that traffic is still served when a datacenter fails, or to quickly isolate an edge in the traffic flow graph where a failure has occurred.
33 changes: 33 additions & 0 deletions src/part-3/chapter-10/collection_of_exported_data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
## **Collection of Exported Data**

## **收集导出的数据**

To find its targets, a Borgmon instance is configured with a list of targets using one of many name resolution methods. The target list is often dynamic, so using service discovery reduces the cost of maintaining it and allows the monitoring to scale.

为了找到它的目标,Borgmon实例使用许多名称解析方法配置了一个目标列表。目标列表通常是动态的,因此使用服务发现可以降低维护它的成本并允许扩展监控。

At predefined intervals, Borgmon fetches the `/varz` URI on each target, decodes the results, and stores the values in memory. Borgmon also spreads the collection from each instance in the target list over the whole interval, so that collection from each target is not in lockstep with its peers.

在预定义的时间间隔内,Borgmon获取每个目标上的`/varz`URI,解码结果,并将值存储在内存中。Borgmon还将目标列表中每个实例的集合分布在整个时间间隔内,这样每个目标的集合就不会与其他实例重叠。

Borgmon also records "synthetic" variables for each target in order to identify:

* If the name was resolved to a host and port
* If the target responded to a collection
* If the target responded to a health check
* What time the collection finished

Borgmon 还记录了每个目标的“合成”变量,以便识别:

* 如果名称被解析为主机和端口
* 如果目标响应了一个集合
* 如果目标响应健康检查
* 收集完成的时间

These synthetic variables make it easy to write rules to detect if the monitored tasks are unavailable.

这些合成变量使编写规则以检测受监视任务是否可用变得容易。

It’s interesting that `varz` is quite dissimilar to SNMP (Simple Networking Monitoring Protocol), which "is designed [...] to have minimal transport requirements and to continue working when most other network applications fail" [[Mic03]](https://technet.microsoft.com/en-us/library/cc776379%28v=ws.10%29.aspx). Scraping targets over HTTP seems to be at odds with this design principle; however, experience shows that this is rarely an issue. The system itself is already designed to be robust against network and machine failures, and Borgmon allows engineers to write smarter alerting rules by using the collection failure itself as a signal.

有趣的是,`varz`与SNMP非常不同,SNMP“被设计为具有最小的传输要求并在大多数其他网络应用程序失败时继续工作”[[Mic03]](https://technet.microsoft.com/en-us/library/cc776379%28v=ws.10%29.aspx)。通过HTTP获取目标似乎与这个设计原则不一致;然而,经验表明这很少成为问题。系统本身已经被设计为能够抵抗网络和机器故障,Borgmon允许工程师通过使用收集故障本身作为信号来编写更智能的警报规则。
Binary file added src/part-3/chapter-10/figures/10-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/part-3/chapter-10/figures/10-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/part-3/chapter-10/figures/10-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions src/part-3/chapter-10/instrumentation_of_applications.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## **Instrumentation of Applications**

## **应用程序检测**

The `/varz` HTTP handler simply lists all the exported variables in plain text, as space- separated keys and values, one per line. A later extension added a mapped variable, which allows the exporter to define several labels on a variable name, and then export a table of values or a histogram. An example map-valued variable looks like the following, showing 25 HTTP 200 responses and 12 HTTP 500s:

`/varz`HTTP处理程序简单地以纯文本形式列出所有导出的变量,作为空格分隔的键和值,每行一个。后来的扩展添加了一个映射变量,它允许导出器在变量名称上定义多个标签,然后导出值表或直方图。示例映射值变量如下所示,显示25个HTTP 200响应和12个HTTP 500:

`http_responses map:code 200:25 404:0 500:12`

Adding a metric to a program only requires a single declaration in the code where the metric is needed.

将度量添加到程序只需要在需要度量的代码中进行单个声明。

In hindsight, it’s apparent that this schemaless textual interface makes the barrier to adding new instrumentation very low, which is a positive for both the software engineering and SRE teams. However, this has a trade-off against ongoing maintenance; the decoupling of the variable definition from its use in Borgmon rules requires careful change management. In practice, this trade-off has been satisfactory because tools to validate and generate rules have been written as well.

事后看来,很明显,这种无模式的文本界面使得添加新工具的障碍非常低,这对软件工程和SRE团队都是积极的。但是,这与持续维护有一个权衡;将变量定义与其在Borgmon规则中的使用分离需要谨慎的变更管理。在实践中,这种折衷是令人满意的,因为验证和生成规则的工具也已编写。
23 changes: 23 additions & 0 deletions src/part-3/chapter-10/maintaining_the_configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## **Maintaining the Configuration**

Borgmon configuration separates the definition of the rules from the targets being monitored. This means the same sets of rules can be applied to many targets at once, instead of writing nearly identical configuration over and over. This separation of concerns might seem incidental, but it greatly reduces the cost of maintaining the monitoring by avoiding lots of repetition in describing the target systems.

Borgmon also supports language templates. This macro-like system enables engineers to construct libraries of rules that can be reused. This functionality again reduces repetition, thus reducing the likelihood of bugs in the configuration.

Of course, any high-level programming environment creates the opportunity for complexity, so Borgmon provides a way to build extensive unit and regression tests by synthesizing time-series data, in order to ensure that the rules behave as the author thinks they do. The Production Monitoring team runs a continuous integration service that executes a suite of these tests, packages the configuration, and ships the configuration to all the Borgmon in production, which then validate the configuration before accepting it.

In the vast library of common templates that have been created, two classes of monitoring configuration have emerged. The first class simply codifies the emergent schema of variables exported from a given library of code, such that any user of the library can reuse the template of its varz. Such templates exist for the HTTP server library, memory allocation, the storage client library, and generic RPC services, among others. (While the varz interface declares no schema, the rule library associated with the code library ends up declaring a schema.)

The second class of library emerged as we built templates to manage the aggregation of data from a single-server task to the global service footprint. These libraries contain generic aggregation rules for exported variables that engineers can use to model the topology of their service.

For example, a service may provide a single global API, but be homed in many datacenters. Within each datacenter, the service is composed of several shards, and each shard is composed of several jobs with various numbers of tasks. An engineer can model this breakdown with Borgmon rules so that when debugging, subcomponents can be isolated from the rest of the system. These groupings typically follow the shared fate of components; e.g., individual tasks share fate due to configuration files, jobs in a shard share fate because they’re homed in the same datacenter, and physical sites share fate due to networking.

Labeling conventions make such division possible: a Borgmon adds labels indicating the target’s instance name and the shard and datacenter it occupies, which can be used to group and aggregate those time-series together.

Thus, we have multiple uses for labels on a time-series, though all are interchangeable:

* Labels that define breakdowns of the data itself (e.g., our HTTP response code on the http_responses variable)
* Labels that define the source of the data (e.g., the instance or job name)
* Labels that indicate the locality or aggregation of the data within the service as a whole (e.g., the zone label describing a physical location, a shard label describing a logical grouping of tasks)

The templated nature of these libraries allows flexibility in their use. The same template can be used to aggregate from each tier.
52 changes: 52 additions & 0 deletions src/part-3/chapter-10/practical_alerting_from_time-series_data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# **Practical Alerting from Time-Series Data**

# **来自时间序列数据的实用警报**

> Written by Jamie Wilkinson
>
> Edited by Kavita Guliani

> May the queries flow, and the pager stay silent.
>
> —Traditional SRE blessing

Monitoring, the bottom layer of the Hierarchy of Production Needs, is fundamental to running a stable service. Monitoring enables service owners to make rational decisions about the impact of changes to the service, apply the scientific method to incident response, and of course ensure their reason for existence: to measure the service’s alignment with business goals (see [Chapter 6](../../part-2/chapter-06/monitoring_distributed_systems.md)).

监控是生产需求层次结构的底层,是运行稳定服务的基础。监控使服务所有者能够就服务变更的影响做出合理的决策,将科学方法应用于事件响应,当然还要确保它们存在的理由:衡量服务与业务目标的一致性(参见[第6章]((../../part-2/chapter-06/monitoring_distributed_systems.md)))。

Regardless of whether or not a service enjoys SRE support, it should be run in a symbiotic relationship with its monitoring. But having been tasked with ultimate responsibility for Google Production, SREs develop a particularly intimate knowledge of the monitoring infrastructure that supports their service.

无论服务是否有SRE支持,它都应该与其监控以共生关系运行。但是,由于承担了Google生产的最终责任,SRE对支持其服务的监控基础设施有了特别深入的了解。

Monitoring a very large system is challenging for a couple of reasons:

* The sheer number of components being analyzed
* The need to maintain a reasonably low maintenance burden on the engineers responsible for the system

出于以下几个原因,监控大型系统具有挑战性:

* 具有大量需要被分析的组件
* 保持负责系统的工程师合理的低维护负担

Google’s monitoring systems don’t just measure simple metrics, such as the average response time of an unladen European web server; we also need to understand the distribution of those response times across all web servers in that region. This knowledge enables us to identify the factors contributing to the latency tail.

Google的监控系统不仅仅测量简单的指标,例如空载欧洲网络服务器的平均响应时间;我们还需要了解该地区所有网络服务器的响应时间分布。这些知识使我们能够识别导致尾部延迟的因素。

At the scale our systems operate, being alerted for single-machine failures is unacceptable because such data is too noisy to be actionable. Instead we try to build systems that are robust against failures in the systems they depend on. Rather than requiring management of many individual components, a large system should be designed to aggregate signals and prune outliers. We need monitoring systems that allow us to alert for high-level service objectives, but retain the granularity to inspect individual components as needed.

在我们系统运行的规模上,收到单机故障警报是不可接受的,因为此类数据噪音太大,无法采取行动。相反,我们尝试构建能够抵抗它们所依赖的系统中的故障的系统。与其要求管理许多单独的组件,不如设计一个大型系统来聚合信号和筛选异常值。我们需要监控系统,使我们能够针对高级服务目标发出警报,但保留粒度以根据需要检查各个组件。

Google’s monitoring systems evolved over the course of 10 years from the traditional model of custom scripts that check responses and alert, wholly separated from visual display of trends, to a new paradigm. This new model made the collection of time-series a first-class role of the monitoring system, and replaced those check scripts with a rich language for manipulating time-series into charts and alerts.

在10年的时间里,Google的监控系统从传统的检查响应和警报的自定义脚本模型演变为一个新的模型。这种新模型使时间序列的收集成为监控系统的主要角色,并用丰富的语言取代了那些检查脚本,用于将时间序列转换为成图表和警报。

* [The Rise of Borgmon(Borgmon的兴起)](the_rise_of_borgmon.md)
* [Instrumentation of Applications(应用程序检测)](instrumentation_of_applications.md)
* [](collection_of_exported_data.md)
* [](storage_in_the_time-series_arena.md)
* [](rule_evaluation.md)
* [](alerting.md)
* [](sharding_the_monitoring_topology.md)
* [](black-box_monitoring.md)
* [](maintaining_the_configuration.md)
* [](ten_years_on.md)
Loading