Skip to content

Commit

Permalink
14章校验
Browse files Browse the repository at this point in the history
  • Loading branch information
qiangmzsx committed Jul 28, 2022
1 parent 5515592 commit 2825441
Showing 1 changed file with 14 additions and 13 deletions.
27 changes: 14 additions & 13 deletions zh-cn/Chapter-14_Larger_Testing/Chapter-14_Larger_Testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,21 +10,21 @@

In previous chapters, we have recounted how a testing culture was established at Google and how small unit tests became a fundamental part of the developer workflow. But what about other kinds of tests? It turns out that Google does indeed use many larger tests, and these comprise a significant part of the risk mitigation strategy necessary for healthy software engineering. But these tests present additional challenges to ensure that they are valuable assets and not resource sinks. In this chapter, we’ll discuss what we mean by “larger tests,” when we execute them, and best practices for keeping them effective.

在前几章中,我们已经讲述了测试文化是如何在Google建立的,以及小单元测试是如何成为开发人员工作流程的基本组成部分。但是其他类型的测试呢?事实证明,Google确实使用了许多大型测试,这些测试构成了健康的软件工程所需的风险缓解策略的重要组成部分。但是,这些测试导致了额外的挑战,以确保它们是有价值的资产而不是资源黑洞。在这一章中,我们将讨论什么是 "大型测试",什么时候执行这些测试,以及保持其有效性的最佳做法。
在前几章中,我们已经讲述了测试文化是如何在Google建立的,以及小型单元测试是如何成为开发人员工作流程的基本组成部分。那么其他类型的测试呢?事实证明,Google确实使用了许多大型测试,这些测试构成了健康的软件工程所需的风险缓解策略的重要组成部分。但是想要确保它们是有价值的资产而不是资源黑洞,那么这些测试面临了更多的挑战。在这一章中,我们将讨论什么是 "大型测试",什么时候执行这些测试,以及保持其有效性的最佳做法。

## What Are Larger Tests? 什么是大型测试?

As mentioned previously, Google has specific notions of test size. Small tests are restricted to one thread, one process, one machine. Larger tests do not have the same restrictions. But Google also has notions of test scope. A unit test necessarily is of smaller scope than an integration test. And the largest-scoped tests (sometimes called end-to-end or system tests) typically involve several real dependencies and fewer test doubles.

如前所述,谷歌对测试规模有特定的概念。小型测试仅限于一个线程、一个进程、一台机器。较大的测试没有相同的限制。但谷歌也有测试范围的概念。单元测试的范围必然比集成测试的范围小。而最大范围的测试(有时被称为端到端或系统测试)通常涉及多个实际依赖项和较少的测试替身。(`Test Double`是在Martin Fowler的文章[Test Double](https://martinfowler.com/bliki/TestDouble.html)中,Gerard Meszaros提出了这个概念。虽然是06年的文章了,但里面的概念并不过时。这篇文章提到`Test Double`只是一个通用的词,代表为了达到测试目的并且减少被测试对象的依赖,使用“替身”代替一个真实的依赖对象,从而保证了测试的速度和稳定性。统一翻译为测试替代)
如前所述,谷歌对测试规模有特定的概念。小型测试仅限于单线程、单进程、单服务器。较大的测试没有相同的限制。但谷歌也有测试范围的概念。单元测试的范围必然比集成测试的范围小。而最大范围的测试(有时被称为端到端或系统测试)通常涉及多个实际依赖项和较少的测试替身。(`Test Double`是在Martin Fowler的文章[Test Double](https://martinfowler.com/bliki/TestDouble.html)中,Gerard Meszaros提出了这个概念。虽然是06年的文章了,但里面的概念并不过时。这篇文章提到`Test Double`只是一个通用的词,代表为了达到测试目的并且减少被测试对象的依赖,使用“替身”代替一个真实的依赖对象,从而保证了测试的速度和稳定性。统一翻译为测试替代)

Larger tests are many things that small tests are not. They are not bound by the same constraints; thus, they can exhibit the following characteristics:
- They may be slow. Our large tests have a default timeout of 15 minutes or 1 hour, but we also have tests that run for multiple hours or even days.
- They may be nonhermetic. Large tests may share resources with other tests and traffic.
- They may be nondeterministic. If a large test is nonhermetic, it is almost impossible to guarantee determinism: other tests or user state may interfere with it.

较大的测试有许多是小型测试所不具备的内容。它们不受相同的约束;因此,它们可以表现出以下特征:
- 它们可能很慢。我们的大型测试的默认超时时间为15分钟或1小时,但我们也有运行数小时甚至数天的测试。
较大的测试有许多是小型测试所不具备的内容。它们受的约束不同;因此,它们可以表现出以下特征:
- 它们可能很慢。我们的大型测试的默认时长时间为15分钟或1小时,但我们也有运行数小时甚至数天的测试。
- 它们可能是不封闭的。大型测试可能与其他测试和流量共享资源。
- 它们可能是不确定的。如果大型测试是非密封的,则几乎不可能保证确定性:其他测试或用户状态可能会干扰它。

Expand All @@ -44,31 +44,31 @@ Unit tests can give you confidence about individual functions, objects, and modu

The primary reason larger tests exist is to address *fidelity*. Fidelity is the property by which a test is reflective of the real behavior of the system under test (SUT).

大型测试存在的主要原因是为了解决保真度问题。仿真度是测试反映被测系统(SUT)真实行为的属性。
大型测试存在的主要原因是为了解决仿真度问题。仿真度是测试反映被测系统(SUT)真实行为的属性。

One way of envisioning fidelity is in terms of the environment. As [Figure 14-1 ](#_bookmark1192)illustrates, unit tests bundle a test and a small portion of code together as a runnable unit, which ensures the code is tested but is very different from how production code runs. Production itself is, naturally, the environment of highest fidelity in testing. There is also a spectrum of interim options. A key for larger tests is to find the proper fit, because increasing fidelity also comes with increasing costs and (in the case of production) increasing risk of failure.

一种设想保真度的方法是在环境方面。如图14-1所示,单元测试将测试和一小部分代码捆绑在一起作为一个可运行的单元,这确保了代码得到测试,但与生产代码的运行方式有很大不同。产品本身才是测试中仿真度最高的环境。也有一系列的临时选项。大型测试的一个关键是要找到适当的契合点,因为提高仿真度也伴随着成本的增加和(在线上的情况下)故障风险的增加。
一种设想仿真度的方法是在环境方面。如图14-1所示,单元测试将测试和一小部分代码捆绑在一起作为一个可运行的单元,这确保了代码得到测试,但与生产代码的运行方式有很大不同。产品本身才是测试中仿真度最高的环境。也有一系列的临时选项。大型测试的一个关键是要找到适当的契合点,因为提高仿真度也伴随着成本的增加和(在线上的情况下)故障风险的增加。

![Figure 14-1](./images/Figure%2014-1.png)

*Figure 14-1. Scale of increasing fidelity*
*Figure 14-1. Scale of increasing fidelity* *图14-1 环境仿真度递增的尺度*

Tests can also be measured in terms of how faithful the test content is to reality. Many handcrafted, large tests are dismissed by engineers if the test data itself looks unrealistic. Test data copied from production is much more faithful to reality (having been captured that way), but a big challenge is how to create realistic test traffic *before* launching the new code. This is particularly a problem in artificial intelligence (AI), for which the “seed” data often suffers from intrinsic bias. And, because most data for unit tests is handcrafted, it covers a narrow range of cases and tends to conform to the biases of the author. The uncovered scenarios missed by the data represent a fidelity gap in the tests.

测试也可以用测试内容对现实的仿真度程度来衡量。如果测试数据本身看起来不真实,许多手工配置的大型测试就会被工程师驳回。从生产中复制的测试数据仿真度更高(以这种方式捕获),但一个很大的挑战是如何在*启动新代码之前*创建真实的测试流量。这在人工智能(AI)中尤其是一个问题,因为 "种子 "数据经常受到内在偏见的影响。而且,由于大多数单元测试的数据是手工配置的,它涵盖的案例范围很窄,并倾向于符合作者的偏见。数据所遗漏的场景代表了测试中的仿真度差距。
测试也可以用测试内容对现实的仿真度程度来衡量。如果测试数据本身看起来不真实,许多手工配置的大型测试就会被工程师摒弃。从生产中复制的测试数据仿真度更高(以这种方式捕获),但一个很大的挑战是如何在*启动新代码之前*创建真实的测试流量。这在人工智能(AI)中尤其是一个问题,因为 "种子 "数据经常受到内在偏见的影响。而且,由于大多数单元测试的数据是手工配置的,它涵盖的案例范围很窄,并倾向于符合作者的偏见。数据所遗漏的场景代表了测试中的仿真度差距。

### Common Gaps in Unit Tests 单元测试中常见的差距
### Common Gaps in Unit Tests 单元测试中常见的问题

Larger tests might also be necessary where smaller tests fail. The subsections that follow present some particular areas where unit tests do not provide good risk mitigation coverage.

如果较小的测试失败,也可能需要进行较大的测试。下面的小节介绍了单元测试无法提供良好风险缓解覆盖率的一些特殊领域
如果较小的测试失败,也可能需要进行较大的测试。下面的小节介绍了单元测试无法提供良好风险缓解覆盖一些特定领域的示例。

#### Unfaithful doubles 仿真度不足的测试替代

A single unit test typically covers one class or module. Test doubles (as discussed in [Chapter 13](#_bookmark1056)) are frequently used to eliminate heavyweight or hard-to-test dependencies. But when those dependencies are replaced, it becomes possible that the replacement and the doubled thing do not agree.

一个单元测试通常覆盖一个类或模块。测试替代(如第13章所讨论的)经常被用来消除重量级或难以测试的依赖项。但是当这些依赖关系被替换时,就有可能出现替换后的东西和被替换的东西不一致的情况
一个单元测试通常覆盖一个类或模块。测试替代(如第13章所讨论的)经常被用来消除重量级或难以测试的依赖项。但是当这些依赖关系被替换时,就有可能出现替换后的东西和被替换的东西不匹配·的情况

Almost all unit tests at Google are written by the same engineer who is writing the unit under test. When those unit tests need doubles and when the doubles used are mocks, it is the engineer writing the unit test defining the mock and its intended behavior. But that engineer usually did *not* write the thing being mocked and can be misinformed about its actual behavior. The relationship between the unit under test and a given peer is a behavioral contract, and if the engineer is mistaken about the actual behavior, the understanding of the contract is invalid.

Expand Down Expand Up @@ -438,7 +438,7 @@ Data can be generated in different ways, such as the following:
After an SUT is running and traffic is sent to it, we must still verify the behavior. There are a few different ways to do this:
- *Manual*
Much like when you try out your binary locally, manual verification uses humans to interact with an SUT to determine whether it functions correctly. This verification can consist of testing for regressions by performing actions as defined on a consistent test plan or it can be exploratory, working a way through different interaction paths to identify possible new failures.
Note that manual regression testing does not scale sublinearly: the larger a system grows and the more journeys through it there are, the more human time is needed to manually test.
Note that manual regression testing does not scale sublinearly: the larger a system grows and the more journeys through it there are, the more human time is needed to manually test.
- *Assertions*
Much like with unit tests, these are explicit checks about the intended behavior of the system. For example, for an integration test of Google search of xyzzy, an assertion might be as follows:

Expand All @@ -452,7 +452,7 @@ assertThat(response.Contains("Colossal Cave"))
在SUT运行并向其发送流量后,我们仍然必须验证其行为。有几种不同的方法可以做到这一点。
- *手动*
就像你在本地尝试你的二进制文件一样,手动验证使用人工与SUT互动以确定它的功能是否正确。这种验证可以包括通过执行一致的测试计划中定义的操作来测试回归,也可以是探索性的,通过不同的交互路径来识别可能的新故障。
需要注意的是,人工回归测试的规模化不是线性的:系统越大,通过它的操作越多,需要人力测试的时间就越多。
需要注意的是,人工回归测试的规模化不是线性的:系统越大,通过它的操作越多,需要人力测试的时间就越多。
- *断言*
与单元测试一样,这些是对系统预期行为的明确检查。例如,对于谷歌搜索xyzzy的集成测试,一个断言可能如下:

Expand Down Expand Up @@ -784,6 +784,7 @@ Production-based testing makes it possible to collect a lot of data about user b
- *Rater* *evaluation*
Human raters are presented with results for a given operation and choose which one is “better” and why. This feedback is then used to determine whether a given change is positive, neutral, or negative. For example, Google has historically used rater evaluation for search queries (we have published the guidelines we give our raters). In some cases, the feedback from this ratings data can help determine launch go/no-go for algorithm changes. Rater evaluation is critical for nondeterministic systems like machine learning systems for which there is no clear correct answer, only a notion of better or worse.


基于产品的测试可以收集大量关于用户行为的数据。我们有几种不同的方法来收集有关即将推出的功能的受欢迎程度和问题的指标,这为我们提供了UAT的替代方案:
- *狗粮*
我们可以利用有限的推广和实验,将生产中的功能提供给一部分用户使用。我们有时会和自己的员工一起这样做(吃自己的狗粮),他们会在真实的部署环境中给我们提供宝贵的反馈。
Expand Down

0 comments on commit 2825441

Please sign in to comment.