Skip to content

Commit

Permalink
25章415
Browse files Browse the repository at this point in the history
  • Loading branch information
qiangmzsx committed Jul 20, 2023
1 parent 52a3d1a commit e47672f
Showing 1 changed file with 6 additions and 6 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -328,11 +328,11 @@ As mentioned earlier, if anything in the system has the name of the host on whic

The answer is to have an extra layer of indirection; that is, other applications refer to your application by some identifier that is durable across restarts of the specific “backend” instances. That identifier can be resolved by another system that the scheduler writes to when it places your application on a particular machine. Now, to avoid distributed storage lookups on the critical path of making a request to your application, clients will likely look up the address that your app can be found on, and set up a connection, at startup time, and monitor it in the background. This is generally called *service discovery*, and many compute offerings have built-in or modular solutions. Most such solutions also include some form of load balancing, which reduces coupling to specific backends even more.

答案是有一个额外的代理层;也就是说,其他应用程序通过某个标识符来引用你的应用程序,这些标识符在特定的 "后端 "实例的重启中是持久的。这个标识符可以由另一个系统来解决,当调度器把你的应用程序放在一个特定的机器上时,它就会写到这个系统。现在,为了避免在向你的应用程序发出请求的关键路径上进行分布式存储查询,客户可能会在启动时查询你的应用程序的地址,并建立一个连接,并在后台监控它。这通常被称为*服务发现*,许多计算产品有内置或模块化的解决方案。大多数这样的解决方案还包括某种形式的负载平衡,这就进一步减少了与特定后端的耦合。
答案是有一个额外的代理层;也就是说,其他应用程序通过某个标识符来引用你的应用程序,这些标识符在特定的"后端"实例的重启中是持久的。这个标识符可以由另一个系统来解决,当调度器把你的应用程序放在一个特定的机器上时,它就会写到这个系统。现在,为了避免在向你的应用程序发出请求的关键路径上进行分布式存储查询,客户可能会在启动时查询你的应用程序的地址,并建立一个连接,并在后台监控它。这通常被称为*服务发现*,许多计算产品有内置或模块化的解决方案。大多数这样的解决方案还包括某种形式的负载平衡,这就进一步减少了与特定后端的耦合。

A repercussion of this model is that you will likely need to repeat your requests in some cases, because the server you are talking to might be taken down before it manages to answer.[^14] Retrying requests is standard practice for network communication (e.g., mobile app to a server) because of network issues, but it might be less intuitive for things like a server communicating with its database. This makes it important to design the API of your servers in a way that handles such failures gracefully. For mutating requests, dealing with repeated requests is tricky. The property you want to guarantee is some variant of *idempotency—*that the result of issuing a request twice is the same as issuing it once. One useful tool to help with idempotency is client- assigned identifiers: if you are creating something (e.g., an order to deliver a pizza to a specific address), the order is assigned some identifier by the client; and if an order with that identifier was already recorded, the server assumes it’s a repeated request and reports success (it might also validate that the parameters of the order match).

这种模式的影响是,在某些情况下,你可能需要重复你的请求,因为你对话的服务器可能在响应之前就被关闭了。由于网络问题,重试请求是网络通信的标准做法(例如,移动应用程序到服务器),但对于像服务器与数据库通信的事情来说,这可能不够直接。这使得在设计你的服务器的API时,必须能够优雅地处理这种故障。对于突变的请求,处理重复请求是很棘手的。你想保证的属性是*幂等性变体*--发出一个请求两次的结果与发出一次相同。帮助实现幂等性的一个有用工具是客户机指定的标识符:如果你正在创建一些东西(例如,将比萨饼送到一个特定的地址的订单),该订单由客户端分配一些标识符;如果一个具有该标识符的订单已经被记录下来,服务器会认为这是一个重复的请求并报告成功(它也可能验证该订单的参数是否匹配)。
这种模式的影响是,在某些情况下,你可能需要重复你的请求,因为你对话的服务器可能在响应之前就被关闭了。由于网络问题,重试请求是网络通信的标准做法(例如,移动应用程序到服务器),但对于像服务器与数据库通信的事情来说,这可能不够直接。这使得在设计你的服务器的API时,必须能够优雅地处理这种故障。对于突变的请求,处理重复请求是很棘手的。你想保证的属性是*幂等性变体*--发出一个请求两次的结果与发出一次相同。帮助实现幂等性的一个有用工具是客户机指定的标识符:如果你正在创建一些内容(例如,将比萨饼送到一个特定的地址的订单),该订单由客户端分配一些标识符;如果一个具有该标识符的订单已经被记录下来,服务器会认为这是一个重复的请求并报告成功(它也可能验证该订单的参数是否匹配)。

One more surprising thing that we saw happen is that sometimes the scheduler loses contact with a particular machine due to some network problem. It then decides that all of the work there is lost and reschedules it onto other machines—and then the machine comes back! Now we have two programs on two different machines, both thinking they are “replica072.” The way for them to disambiguate is to check which one of them is referred to by the address resolution system (and the other one should terminate itself or be terminated); but it also is one more case for idempotency: two replicas performing the same work and serving the same role are another potential source of request duplication.

Expand All @@ -350,15 +350,15 @@ Most of the previous discussion focused on production-quality jobs, either those

Often, the engineer’s workstation is a satisfactory solution to the need for compute resources. If one wants to, say, automate the skimming through the 1 GB of logs that a service produced over the last day to check whether a suspicious line A always occurs before the error line B, they can just download the logs, write a short Python script, and let it run for a minute or two.

通常,工程师工作站是满足计算资源需求的满意解决方案。比如说,如果想自动浏览服务在最后一天生成的1GB日志,以检查可疑行a是否总是出现在错误行B之前,他们可以下载日志,编写一个简短的Python脚本,然后让它运行一两分钟
通常,工程师的工作站是满足计算资源需求的满意解决方案。比如说,如果想自动浏览服务在最后一天生成的1GB日志,以检查可疑A行是否总是出现在错误B行之前,他们可以下载日志,编写一个简短的Python脚本,然后让它运行一两分钟即可

But if they want to automate the skimming through 1 TB of logs that service produced over the last year (for a similar purpose), waiting for roughly a day for the results to come in is likely not acceptable. A compute service that allows the engineer to just run the analysis on a distributed environment in several minutes (utilizing a few hundred cores) means the difference between having the analysis now and having it tomorrow. For tasks that require iteration—for example, if I will need to refine the query after seeing the results—the difference may be between having it done in a day and not having it done at all.

但是,如果他们想自动浏览去年服务生产的1 TB日志(出于类似目的),等待大约一天的结果可能是不可接受的。一个允许工程师在几分钟内(利用几百个内核)在分布式环境中运行分析的计算服务意味着现在进行分析和明天进行分析的区别。例如,对于需要迭代的任务,如果我在看到结果后需要优化查询,那么在一天内完成查询和根本不完成查询之间可能存在差异
但是,如果他们想自动浏览去年服务生产的1 TB日志(出于类似目的),等待大约一天的结果可能是不可接受的。一个允许工程师在几分钟内(利用几百个内核)在分布式环境中运行分析的计算服务意味着立即得到分析结果和明天才能得到结果之间的区别。例如,对于需要迭代的任务,如果我在看到结果后需要优化查询,那么在一天内完成任务和根本无法完成任务之间可能存在差异

One concern that arises at times with this approach is that allowing engineers to just run one-off jobs on the distributed environment risks them wasting resources. This is, of course, a trade-off, but one that should be made consciously. It’s very unlikely that the cost of processing that the engineer runs is going to be more expensive than the engineer’s time spent on writing the processing code. The exact trade-off values differ depending on an organization’s compute environment and how much it pays its engineers, but it’s unlikely that a thousand core hours costs anything close to a day of engineering work. Compute resources, in that respect, are similar to markers, which we discussed in the opening of the book; there is a small savings opportunity for the company in instituting a process to acquire more compute resources, but this process is likely to cost much more in lost engineering opportunity and time than it saves.

这种方法有时会引起一个问题,即允许工程师在分布式环境中运行一次性作业可能会浪费资源。当然,这是一种权衡,但应该有意识地进行权衡。工程师运行的处理成本很可能不会比工程师写处理代码的时间更贵。确切的权衡取决于一个组织的计算环境和它付给工程师的工资多少,但一千个核心小时的成本不太可能接近一天的工程工作。在这方面,计算资源类似于标记,我们在本书的开篇中讨论过;对于公司来说,建立一个获取更多计算资源的过程是一个很小的节约机会,但是这个过程在失去工程机会和时间方面的成本可能比它节省的成本高得多。
这种方法有时会引起一个问题,即允许工程师在分布式环境中运行一次性作业可能会浪费资源。当然,这是一种权衡,但应该有意识地进行权衡。工程师运行的处理成本很可能不会比工程师写处理代码的时间更贵。确切的权衡取决于一个组织的计算环境和它付给工程师的工资多少,但一千个核心小时的成本不太可能接近一天的工程工作量。在这方面,计算资源类似于标记,我们在本书的开篇中讨论过;对于公司来说,建立一个获取更多计算资源的过程是一个很小的节约机会,但是这个过程在失去工程机会和时间方面的成本可能比它节省的成本高得多。

That said, compute resources differ from markers in that it’s easy to take way too many by accident. Although it’s unlikely someone will carry off a thousand markers, it’s totally possible someone will accidentally write a program that occupies a thousand machines without noticing.[^15] The natural solution to this is instituting quotas for resource usage by individual engineers. An alternative used by Google is to observe that because we’re running low-priority batch workloads effectively for free (see the section on multitenancy later on), we can provide engineers with almost unlimited quota for low-priority batch, which is good enough for most one-off engineering tasks.

Expand Down Expand Up @@ -390,7 +390,7 @@ Let’s discuss two examples of how a containerized abstraction allows an organi

A *filesystem abstraction* provides a way to incorporate software that was not written in the company without the need to manage custom machine configurations. This might be open source software an organization runs in its datacenter, or acquisitions that it wants to onboard onto its CaaS. Without a filesystem abstraction, onboarding a binary that expects a different filesystem layout (e.g., expecting a helper binary at */bin/foo/bar*) would require either modifying the base layout of all machines in the fleet, or fragmenting the fleet, or modifying the software (which might be difficult, or even impossible due to licence considerations).

*文件系统抽象*提供了一种方法,可以将不是在公司编写的软件纳入其中,而不需要管理自定义的机器配置。这可能是某个组织在其数据中心运行的开源软件,或者是它想在其CaaS上进行的整合。在没有文件系统抽象的情况下,如果一个二进制文件需要一个不同的文件系统布局(例如,期望在*/bin/foo/bar*有一个附加二进制文件)将需要修改机群中所有机器的基本布局,或者对集群进行分段操作,或者修改软件(这可能很困难,甚至由于许可证的考虑而不可能)。
*文件系统抽象*提供了一种方法,将不是由公司编写的软件整合进来的方式,而无需管理自定义的机器配置。这可能是某个组织在其数据中心运行的开源软件,或者是它想在其CaaS上进行的整合。在没有文件系统抽象的情况下,如果一个二进制文件需要一个不同的文件系统布局(例如,期望在*/bin/foo/bar*有一个附加二进制文件)将需要修改机群中所有机器的基本布局,或者对集群进行分段操作,或者修改软件(这可能会由于许可证考虑而变得困难甚至不可能)。

Even though these solutions might be feasible if importing an external piece of software is something that happens once in a lifetime, it is not a sustainable solution if importing software becomes a common (or even only-somewhat-rare) practice.

Expand Down

0 comments on commit e47672f

Please sign in to comment.