-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ORCA specification #105
base: main
Are you sure you want to change the base?
Changes from all commits
d32a24c
5c850dc
a8a52e8
b0e239e
4fb5a93
761dade
e614ed3
5b4b21e
1ebccfb
3cd3f54
87ea672
3d8efda
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,278 @@ | ||
ORCA Specification | ||
==== | ||
|
||
Last Updated: Oct 24, 2024 | ||
|
||
Harvey Tuch, Mark Roth, Misha Efimov, Andres Guedez | ||
|
||
*NOTE: this is a continously updated ORCA specification based on Harvey's [original | ||
specification](https://docs.google.com/document/d/1NSnK3346BkBo1JUU3I9I5NYYnaJZQPt8_Z_XCBCI3uA/edit?tab=t.0#heading=h.xgjl2srtytjt).* | ||
|
||
# Overview | ||
|
||
Simple load balancing decisions can be made by taking into account local or global | ||
knowledge of a backend’s load, for example CPU. More sophisticated load balancing decisions are | ||
possible with application specific knowledge, e.g. internal queue depths, or by combining multiple | ||
application and/or hardware resource metrics. | ||
|
||
This is useful for services that may be resource constrained along multiple dimensions (e.g. both | ||
CPU and memory may become bottlenecks, depending on the applied load and execution environment, it’s | ||
not possible to tell which upfront) and where these dimensions do not slot within predefined | ||
categories (e.g. the resource may be “number of free threads in a pool”, disk IOPS, etc.). | ||
|
||
This document provides a specification for an open standard for request cost aggregation and | ||
reporting by backends and the corresponding aggregation of such reports by L7 load balancers (such | ||
as Envoy) on the data plane. Within scope is the wire format, backend report collection mechanism | ||
and aggregation at query cost delivery time. Outside of the scope of this document are the LB | ||
algorithms for working with aggregated data and the mechanics of how new balancing decisions are | ||
delivered to data plane load balancers. | ||
|
||
# Background | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this description is a little too xDS-centric: it describes ORCA solely in the context of ORCA-to-LRS propagation, even though ORCA can be used without using xDS at all. I would like to see this section re-written to explain how ORCA can be used to share data between endpoints and clients for things like client-side WRR. We can then add a section at the end to show how it can be used in conjunction with xDS by propagating ORCA data into LRS load reports. |
||
|
||
Data plane load balancers (DPLBs), for example an Envoy proxy or gRPC-LB client, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. s/gRPC-LB/gRPC/ |
||
receive HTTP or gRPC requests from clients, make LB decisions and proxy the request and | ||
corresponding response to/from backends. These load balancing (LB) decisions are driven by LB | ||
assignments delivered by the control plane load balancer (CPLB) via EDS in the Universal Data Plane | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. UDPA was an alternate name that we were pursuing for xDS for a while, but it's been dropped. I think we should just remove the "in the Universal Data Plane API (UDPA)" part -- it's clear enough what EDS is. |
||
API (UDPA). | ||
|
||
The control plane can provide globally optimal decisions via greater visibility than possible in any | ||
single DPLB instance. DPLB’s share their load information with the control plane via LRS and in | ||
return receive EDS assignments. It is the role of the CPLB to make intelligent load balancing | ||
decisions as part of this control system. | ||
|
||
Inputs to the CPLB decision making may include LRS load reports and known capacities for localities | ||
and/or endpoints. This may be, in the simplest case, a metric such as QPS. The DPLB reports total | ||
queries at regular time intervals via LRS, with aggregate counts attributed at locality or endpoint | ||
granularity. | ||
|
||
When capacity is defined with units such as CPU, memory or application-specific criteria, the DPLB | ||
needs to have utilization or cost reported by either the endpoint or via the runtime infrastructure | ||
that the backend is executing on. | ||
|
||
We assume a model below where the CPLB requires from the data plane load balancer: Fixed, periodic, | ||
load report intervals. Total queries binned by source × destination locality delivered every (1). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is supposed to be an enumerated list, but the markdown formatting is wrong. |
||
Aggregate cost (in the same units as capacity) for the queries included in (2) by source × | ||
destination locality. | ||
|
||
Backends providing reports are typically VMs or containers. Since these are implemented and deployed | ||
independent of the DPLB, an open protocol for load reporting from backend to DPLB is desirable. We | ||
refer to this reporting standard and OSS reference patterns and implementations as Open Request Cost | ||
Aggregation (ORCA). | ||
|
||
# Specification | ||
|
||
## ORCA Load Report Format | ||
|
||
We propose a standard format to avoid DPLB operators needing to rewrite their backends for any given | ||
DPLB. | ||
|
||
ORCA reports are essentially key-value pairs, with the capacity name as key (e.g. cpu, memory) and | ||
the value a numerical quantity representing either the absolute cost (e.g. 123 bytes) or | ||
instantaneous sample of utilization at request time (e.g. 33.58%). CPLBs do not need to consider the | ||
semantics of the key, the LB algorithm matches opaque keys in reports with comparable named | ||
capacities in the CPLB capacity configuration. With that in mind, there is a strong case to be made | ||
for standardizing common capacities, since the ORCA reports may also be used by other tools (e.g. | ||
logging, monitoring, debug) that will benefit from a zero-config approach. | ||
|
||
We propose the following key/value pairs be the standard core metrics in ORCA: | ||
|
||
* application\_utilization: Application-specific utilization of backend | ||
* cpu\_utilization: CPU | ||
utilization of backend | ||
* mem\_utilization: Memory utilization of backend | ||
|
||
None of the core ORCA metrics are compulsory, but when reported by a backend they must follow the | ||
above format. Note that there is considerable complexity in providing these metrics for either ORCA, | ||
libraries such as OpenTelemetry or individual DPLB operators who roll their own solutions, when | ||
considering the matrix of platforms, languages and virtualization/container environments. CPU and | ||
memory are, however, application agnostic and generally useful. | ||
|
||
Applications may arbitrarily define additional metrics as keys and units for their corresponding | ||
values. | ||
|
||
ORCA reports are ultimately aggregated into ClusterStats messages in LRS reports from the L7 load | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence doesn't belong in this section. This is describing one possible use of ORCA, whereas this section is about the ORCA load report format. |
||
balancer to the corresponding management server. | ||
|
||
Without prescribing wire format, the following protobuf schema provides the data model for ORCA | ||
reports: | ||
|
||
See https://github.com/cncf/xds/blob/main/xds/data/orca/v3/orca_load_report.proto for the source | ||
of truth. | ||
|
||
``` | ||
message OrcaLoadReport { | ||
// CPU utilization expressed as a fraction of available CPU resources. This | ||
// should be derived from the latest sample or measurement. The value may be | ||
// larger than 1.0 when the usage exceeds the reporter dependent notion of | ||
// soft limits. | ||
double cpu_utilization = 1 [(validate.rules).double.gte = 0]; | ||
|
||
// Memory utilization expressed as a fraction of available memory | ||
// resources. This should be derived from the latest sample or measurement. | ||
double mem_utilization = 2 [(validate.rules).double.gte = 0, (validate.rules).double.lte = 1]; | ||
|
||
// Total RPS being served by an endpoint. This should cover all services that an endpoint is | ||
// responsible for. | ||
// Deprecated -- use ``rps_fractional`` field instead. | ||
uint64 rps = 3 [deprecated = true]; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suggest that we don't list the deprecated fields in this doc. They exist in the protobuf, but if this doc is aimed at showing people how to use ORCA today, we should focus on the non-deprecated fields only. |
||
|
||
// Application specific requests costs. Each value is an absolute cost (e.g. 3487 bytes of | ||
// storage) associated with the request. | ||
map<string, double> request_cost = 4; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to provide guidance on when to use each of the three maps here (request_cost, utilization, and named_metrics). |
||
|
||
// Resource utilization values. Each value is expressed as a fraction of total resources | ||
// available, derived from the latest sample or measurement. | ||
map<string, double> utilization = 5 | ||
[(validate.rules).map.values.double.gte = 0, (validate.rules).map.values.double.lte = 1]; | ||
|
||
// Total RPS being served by an endpoint. This should cover all services that an endpoint is | ||
// responsible for. | ||
double rps_fractional = 6 [(validate.rules).double.gte = 0]; | ||
|
||
// Total EPS (errors/second) being served by an endpoint. This should cover | ||
// all services that an endpoint is responsible for. | ||
double eps = 7 [(validate.rules).double.gte = 0]; | ||
|
||
// Application specific opaque metrics. | ||
map<string, double> named_metrics = 8; | ||
|
||
// Application specific utilization expressed as a fraction of available | ||
// resources. For example, an application may report the max of CPU and memory | ||
// utilization for better load balancing if it is both CPU and memory bound. | ||
// This should be derived from the latest sample or measurement. | ||
// The value may be larger than 1.0 when the usage exceeds the reporter | ||
// dependent notion of soft limits. | ||
double application_utilization = 9 [(validate.rules).double.gte = 0]; | ||
} | ||
``` | ||
|
||
# Reporting | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this sub-section and everything underneath it should actually be under the "Specification" section, which means that a bunch of these headers need one more |
||
|
||
## Inline Per-Request Format | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggest adding the following right before this sub-section: """ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to provide some guidance about when it's useful to choose per-request vs. OOB reporting. I think we should recommend that the vast majority of applications use per-request reporting. OOB reporting should be considered only for traffic patterns that do not result in per-request reports being delivered often enough to stay fresh, such as a client with very sparse traffic or an application whose traffic is mostly long-running streams. We should also point out that OOB reporting is good for utilization but doesn't make much sense for query cost reporting. |
||
|
||
With inline (per-request) reporting, load reports will be | ||
attached to the response headers or trailers of an HTTP/gRPC stream. | ||
|
||
The name of the header is `endpoint-load-metrics` and could include load reports in different formats, | ||
which are determined by the prefix word of the header value. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should say something here about how different DPLBs may support different subsets of these formats, so endpoints may need to be aware of which DPLBs they are sending to when choosing a format. |
||
|
||
### Binary Protobuf | ||
|
||
For Protocol Buffers aware code, this will be a binary serialized base64 encoded OrcaLoadReport | ||
protobuf in `endpoint-load-metrics` header with BIN prefix: | ||
|
||
`endpoint-load-metrics: BIN CZqZmZmZmbk/MQAAAAAAAABAQg4KA2ZvbxGamZmZmZm5P0IOCgNiYXIRmpmZmZmZyT8=` | ||
|
||
It could also be specified in the `endpoint-load-metrics-bin` header without BIN prefix: | ||
|
||
`endpoint-load-metrics-bin: CZqZmZmZmbk/MQAAAAAAAABAQg4KA2ZvbxGamZmZmZm5P0IOCgNiYXIRmpmZmZmZyT8=` | ||
|
||
### Native HTTP | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this header should say "Text format". It's not clear what this has to do with HTTP specifically. |
||
|
||
Comma separated key-value pairs in `endpoint-load-metrics`. This is a flattened | ||
representation of OrcaLoadReport, with the map fields elided into the top level scope by prepending | ||
the ‘<map_name>.’: | ||
|
||
`endpoint-load-metrics: TEXT cpu_utilization=0.3, mem_utilization=0.8, rps_fractional=10.0, eps=1, | ||
named_metrics.custom_metric_util=0.4` | ||
|
||
### JSON | ||
|
||
The JSON encoding of OrcaLoadReport: | ||
|
||
`endpoint-load-metrics: JSON {“cpu_utilization”: 0.3, “mem_utilization”: 0.8, “rps_fractional”: 10.0, | ||
“eps”: 1, “named_metrics”: {“custom-metric-util”: 0.4}}` | ||
|
||
## Out-of-Band (OOB) | ||
|
||
OOB reporting involves an additional load reporting agent that does not sit in the request path. | ||
This is periodically sampled with sufficient frequency to provide temporal association with | ||
requests. The data plane LB may either initiate a connection to the agent or offer a gRPC service | ||
for agents to report to: | ||
|
||
``` | ||
// Out-of-band (OOB) load reporting service for the additional load reporting | ||
// agent that does not sit in the request path. Reports are periodically sampled | ||
// with sufficient frequency to provide temporal association with requests. | ||
// OOB reporting compensates the limitation of in-band reporting in revealing | ||
// costs for backends that do not provide a steady stream of telemetry such as | ||
// long running stream operations and zero QPS services. This is a server | ||
// streaming service, client needs to terminate current RPC and initiate | ||
// a new call to change backend reporting frequency. | ||
service OpenRcaService { | ||
rpc StreamCoreMetrics(OrcaLoadReportRequest) returns (stream xds.data.orca.v3.OrcaLoadReport); | ||
} | ||
|
||
message OrcaLoadReportRequest { | ||
// Interval for generating Open RCA core metric responses. | ||
google.protobuf.Duration report_interval = 1; | ||
// Request costs to collect. If this is empty, all known requests costs tracked by | ||
// the load reporting agent will be returned. This provides an opportunity for | ||
// the client to selectively obtain a subset of tracked costs. | ||
repeated string request_cost_names = 2; | ||
} | ||
``` | ||
|
||
The definition for this service can be found at | ||
https://github.com/cncf/xds/blob/main/xds/service/orca/v3/orca.proto. | ||
|
||
Alternatively, an HTTP endpoint can be exposed by the load reporting server. A GET or HEAD request | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This paragraph seems to describe a possible future improvement, not something that's part of the existing ORCA spec. I suggest either removing it or moving it to a "Possible future improvements" section. |
||
to this endpoint will return a response with the `endpoint-load-metrics` header that contains the | ||
load report in one of the formats listed above. | ||
|
||
# ORCA Integration Considerations | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this section also needs to be updated, since it no longer seems that relevant as currently written. In general, I think users are not going to choose a sidecar or library model based solely on ORCA; instead, they will probably have one of those models already and will want to know what the options are for adding ORCA support. I think what we should be explaining here is that there are cases where a smart data plane like Envoy or gRPC can handle ORCA for you, transparently to the application, and there are other cases where the application will need to generate the ORCA data directly. |
||
|
||
## Sidecar | ||
|
||
A sidecar, e.g. Envoy or some other in-container proxy process, can inject the inline ORCA header | ||
at response completion. | ||
|
||
![Sidecar Integration Diagram](./orca-spec-sidecar.jpg) | ||
|
||
The advantage of a sidecar model is that the application requires no modification as in the example | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is true only if the sidecar is capable of determining the metrics that the endpoint wishes to report to its clients. For example, this would probably work for things like CPU or memory utilization, which the sidecar can determine in a fairly generic way. But it won't work for application-specific things, like custom utilization metrics or query cost metrics -- those will always require cooperation on the part of the application. |
||
service A above. A disadvantage of this approach is that it requires changes to how DPLB operators | ||
operationally manage their backends, introducing a binary that will require upgrades and version | ||
management. This might not be a concern in a service mesh architecture where a sidecar can be taken | ||
as given. | ||
|
||
Service B above links with an observability library such as OpenTelemetry that provides support for | ||
custom metrics. This enables the sidecar to collect additional telemetry data for the ORCA reports. | ||
|
||
## Library | ||
|
||
The application will link in a library for ORCA that will assist with generating the inline response | ||
header. | ||
|
||
![Library Integration Diagram](./orca-spec-library.jpg) | ||
|
||
This will require DPLB operators to modify their applications to call the ORCA library and inject | ||
the results into the HTTP response. | ||
|
||
The library will provide an interface such as: | ||
|
||
`std::string GetLoadReportHeaderValue(const std::unordered_map<std::string, double>& named_metrics);` | ||
|
||
This method will combine the core metrics such as CPU and Memory (computed by the library) and, | ||
together with the extra application-specific metrics, compute the inline ORCA response header. The | ||
application will then be responsible for including this in its response. | ||
|
||
# Known DPLB Implementations | ||
|
||
## gRPC | ||
TODO: [Repository]() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The repo differs by language: |
||
|
||
[Design](https://github.com/grpc/proposal/blob/master/A51-custom-backend-metrics.md) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may also be useful to refer to the following designs as well: |
||
|
||
## Envoy | ||
|
||
[Repository](https://github.com/envoyproxy/envoy/tree/main/source/common/orca) | ||
|
||
[Design](http://docs/document/d/1gb_2pcNnEzTgo1EJ6w1Ol7O-EH-O_Ysu5o215N9MTAg?tab=t.0#heading=h.do9yfa1wlpk8) | ||
|
||
# Change History | ||
|
||
|Date|Description| | ||
|----|-----------| | ||
|Oct 25, 2024|Updated specification in this document.| | ||
|April 16, 2019|Original [specification](https://docs.google.com/document/d/1NSnK3346BkBo1JUU3I9I5NYYnaJZQPt8_Z_XCBCI3uA/edit?tab=t.0) proposed to the Envoy community.| | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move all of this to a top-level "doc" directory (i.e., instead of
orca/
, it should bedoc/orca/
).