Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ORCA specification #105

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Binary file added orca/orca-spec-library.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added orca/orca-spec-sidecar.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
278 changes: 278 additions & 0 deletions orca/orca-spec.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move all of this to a top-level "doc" directory (i.e., instead of orca/, it should be doc/orca/).

Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
ORCA Specification
====

Last Updated: Oct 24, 2024

Harvey Tuch, Mark Roth, Misha Efimov, Andres Guedez

*NOTE: this is a continously updated ORCA specification based on Harvey's [original
specification](https://docs.google.com/document/d/1NSnK3346BkBo1JUU3I9I5NYYnaJZQPt8_Z_XCBCI3uA/edit?tab=t.0#heading=h.xgjl2srtytjt).*

# Overview

Simple load balancing decisions can be made by taking into account local or global
knowledge of a backend’s load, for example CPU. More sophisticated load balancing decisions are
possible with application specific knowledge, e.g. internal queue depths, or by combining multiple
application and/or hardware resource metrics.

This is useful for services that may be resource constrained along multiple dimensions (e.g. both
CPU and memory may become bottlenecks, depending on the applied load and execution environment, it’s
not possible to tell which upfront) and where these dimensions do not slot within predefined
categories (e.g. the resource may be “number of free threads in a pool”, disk IOPS, etc.).

This document provides a specification for an open standard for request cost aggregation and
reporting by backends and the corresponding aggregation of such reports by L7 load balancers (such
as Envoy) on the data plane. Within scope is the wire format, backend report collection mechanism
and aggregation at query cost delivery time. Outside of the scope of this document are the LB
algorithms for working with aggregated data and the mechanics of how new balancing decisions are
delivered to data plane load balancers.

# Background
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this description is a little too xDS-centric: it describes ORCA solely in the context of ORCA-to-LRS propagation, even though ORCA can be used without using xDS at all.

I would like to see this section re-written to explain how ORCA can be used to share data between endpoints and clients for things like client-side WRR. We can then add a section at the end to show how it can be used in conjunction with xDS by propagating ORCA data into LRS load reports.


Data plane load balancers (DPLBs), for example an Envoy proxy or gRPC-LB client,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/gRPC-LB/gRPC/

receive HTTP or gRPC requests from clients, make LB decisions and proxy the request and
corresponding response to/from backends. These load balancing (LB) decisions are driven by LB
assignments delivered by the control plane load balancer (CPLB) via EDS in the Universal Data Plane
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UDPA was an alternate name that we were pursuing for xDS for a while, but it's been dropped. I think we should just remove the "in the Universal Data Plane API (UDPA)" part -- it's clear enough what EDS is.

API (UDPA).

The control plane can provide globally optimal decisions via greater visibility than possible in any
single DPLB instance. DPLB’s share their load information with the control plane via LRS and in
return receive EDS assignments. It is the role of the CPLB to make intelligent load balancing
decisions as part of this control system.

Inputs to the CPLB decision making may include LRS load reports and known capacities for localities
and/or endpoints. This may be, in the simplest case, a metric such as QPS. The DPLB reports total
queries at regular time intervals via LRS, with aggregate counts attributed at locality or endpoint
granularity.

When capacity is defined with units such as CPU, memory or application-specific criteria, the DPLB
needs to have utilization or cost reported by either the endpoint or via the runtime infrastructure
that the backend is executing on.

We assume a model below where the CPLB requires from the data plane load balancer: Fixed, periodic,
load report intervals. Total queries binned by source × destination locality delivered every (1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to be an enumerated list, but the markdown formatting is wrong.

Aggregate cost (in the same units as capacity) for the queries included in (2) by source ×
destination locality.

Backends providing reports are typically VMs or containers. Since these are implemented and deployed
independent of the DPLB, an open protocol for load reporting from backend to DPLB is desirable. We
refer to this reporting standard and OSS reference patterns and implementations as Open Request Cost
Aggregation (ORCA).

# Specification

## ORCA Load Report Format

We propose a standard format to avoid DPLB operators needing to rewrite their backends for any given
DPLB.

ORCA reports are essentially key-value pairs, with the capacity name as key (e.g. cpu, memory) and
the value a numerical quantity representing either the absolute cost (e.g. 123 bytes) or
instantaneous sample of utilization at request time (e.g. 33.58%). CPLBs do not need to consider the
semantics of the key, the LB algorithm matches opaque keys in reports with comparable named
capacities in the CPLB capacity configuration. With that in mind, there is a strong case to be made
for standardizing common capacities, since the ORCA reports may also be used by other tools (e.g.
logging, monitoring, debug) that will benefit from a zero-config approach.

We propose the following key/value pairs be the standard core metrics in ORCA:

* application\_utilization: Application-specific utilization of backend
* cpu\_utilization: CPU
utilization of backend
* mem\_utilization: Memory utilization of backend

None of the core ORCA metrics are compulsory, but when reported by a backend they must follow the
above format. Note that there is considerable complexity in providing these metrics for either ORCA,
libraries such as OpenTelemetry or individual DPLB operators who roll their own solutions, when
considering the matrix of platforms, languages and virtualization/container environments. CPU and
memory are, however, application agnostic and generally useful.

Applications may arbitrarily define additional metrics as keys and units for their corresponding
values.

ORCA reports are ultimately aggregated into ClusterStats messages in LRS reports from the L7 load
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence doesn't belong in this section. This is describing one possible use of ORCA, whereas this section is about the ORCA load report format.

balancer to the corresponding management server.

Without prescribing wire format, the following protobuf schema provides the data model for ORCA
reports:

See https://github.com/cncf/xds/blob/main/xds/data/orca/v3/orca_load_report.proto for the source
of truth.

```
message OrcaLoadReport {
// CPU utilization expressed as a fraction of available CPU resources. This
// should be derived from the latest sample or measurement. The value may be
// larger than 1.0 when the usage exceeds the reporter dependent notion of
// soft limits.
double cpu_utilization = 1 [(validate.rules).double.gte = 0];

// Memory utilization expressed as a fraction of available memory
// resources. This should be derived from the latest sample or measurement.
double mem_utilization = 2 [(validate.rules).double.gte = 0, (validate.rules).double.lte = 1];

// Total RPS being served by an endpoint. This should cover all services that an endpoint is
// responsible for.
// Deprecated -- use ``rps_fractional`` field instead.
uint64 rps = 3 [deprecated = true];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that we don't list the deprecated fields in this doc. They exist in the protobuf, but if this doc is aimed at showing people how to use ORCA today, we should focus on the non-deprecated fields only.


// Application specific requests costs. Each value is an absolute cost (e.g. 3487 bytes of
// storage) associated with the request.
map<string, double> request_cost = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to provide guidance on when to use each of the three maps here (request_cost, utilization, and named_metrics).


// Resource utilization values. Each value is expressed as a fraction of total resources
// available, derived from the latest sample or measurement.
map<string, double> utilization = 5
[(validate.rules).map.values.double.gte = 0, (validate.rules).map.values.double.lte = 1];

// Total RPS being served by an endpoint. This should cover all services that an endpoint is
// responsible for.
double rps_fractional = 6 [(validate.rules).double.gte = 0];

// Total EPS (errors/second) being served by an endpoint. This should cover
// all services that an endpoint is responsible for.
double eps = 7 [(validate.rules).double.gte = 0];

// Application specific opaque metrics.
map<string, double> named_metrics = 8;

// Application specific utilization expressed as a fraction of available
// resources. For example, an application may report the max of CPU and memory
// utilization for better load balancing if it is both CPU and memory bound.
// This should be derived from the latest sample or measurement.
// The value may be larger than 1.0 when the usage exceeds the reporter
// dependent notion of soft limits.
double application_utilization = 9 [(validate.rules).double.gte = 0];
}
```

# Reporting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this sub-section and everything underneath it should actually be under the "Specification" section, which means that a bunch of these headers need one more # character.


## Inline Per-Request Format
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding the following right before this sub-section:

"""
There are two types of load reporting possible for the ORCA data model, per-request and out-of-band (OOB). We address the wire format below with respect to each of these reporting types.
"""

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to provide some guidance about when it's useful to choose per-request vs. OOB reporting.

I think we should recommend that the vast majority of applications use per-request reporting. OOB reporting should be considered only for traffic patterns that do not result in per-request reports being delivered often enough to stay fresh, such as a client with very sparse traffic or an application whose traffic is mostly long-running streams.

We should also point out that OOB reporting is good for utilization but doesn't make much sense for query cost reporting.


With inline (per-request) reporting, load reports will be
attached to the response headers or trailers of an HTTP/gRPC stream.

The name of the header is `endpoint-load-metrics` and could include load reports in different formats,
which are determined by the prefix word of the header value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should say something here about how different DPLBs may support different subsets of these formats, so endpoints may need to be aware of which DPLBs they are sending to when choosing a format.


### Binary Protobuf

For Protocol Buffers aware code, this will be a binary serialized base64 encoded OrcaLoadReport
protobuf in `endpoint-load-metrics` header with BIN prefix:

`endpoint-load-metrics: BIN CZqZmZmZmbk/MQAAAAAAAABAQg4KA2ZvbxGamZmZmZm5P0IOCgNiYXIRmpmZmZmZyT8=`

It could also be specified in the `endpoint-load-metrics-bin` header without BIN prefix:

`endpoint-load-metrics-bin: CZqZmZmZmbk/MQAAAAAAAABAQg4KA2ZvbxGamZmZmZm5P0IOCgNiYXIRmpmZmZmZyT8=`

### Native HTTP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this header should say "Text format". It's not clear what this has to do with HTTP specifically.


Comma separated key-value pairs in `endpoint-load-metrics`. This is a flattened
representation of OrcaLoadReport, with the map fields elided into the top level scope by prepending
the ‘<map_name>.’:

`endpoint-load-metrics: TEXT cpu_utilization=0.3, mem_utilization=0.8, rps_fractional=10.0, eps=1,
named_metrics.custom_metric_util=0.4`

### JSON

The JSON encoding of OrcaLoadReport:

`endpoint-load-metrics: JSON {“cpu_utilization”: 0.3, “mem_utilization”: 0.8, “rps_fractional”: 10.0,
“eps”: 1, “named_metrics”: {“custom-metric-util”: 0.4}}`

## Out-of-Band (OOB)

OOB reporting involves an additional load reporting agent that does not sit in the request path.
This is periodically sampled with sufficient frequency to provide temporal association with
requests. The data plane LB may either initiate a connection to the agent or offer a gRPC service
for agents to report to:

```
// Out-of-band (OOB) load reporting service for the additional load reporting
// agent that does not sit in the request path. Reports are periodically sampled
// with sufficient frequency to provide temporal association with requests.
// OOB reporting compensates the limitation of in-band reporting in revealing
// costs for backends that do not provide a steady stream of telemetry such as
// long running stream operations and zero QPS services. This is a server
// streaming service, client needs to terminate current RPC and initiate
// a new call to change backend reporting frequency.
service OpenRcaService {
rpc StreamCoreMetrics(OrcaLoadReportRequest) returns (stream xds.data.orca.v3.OrcaLoadReport);
}

message OrcaLoadReportRequest {
// Interval for generating Open RCA core metric responses.
google.protobuf.Duration report_interval = 1;
// Request costs to collect. If this is empty, all known requests costs tracked by
// the load reporting agent will be returned. This provides an opportunity for
// the client to selectively obtain a subset of tracked costs.
repeated string request_cost_names = 2;
}
```

The definition for this service can be found at
https://github.com/cncf/xds/blob/main/xds/service/orca/v3/orca.proto.

Alternatively, an HTTP endpoint can be exposed by the load reporting server. A GET or HEAD request
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph seems to describe a possible future improvement, not something that's part of the existing ORCA spec. I suggest either removing it or moving it to a "Possible future improvements" section.

to this endpoint will return a response with the `endpoint-load-metrics` header that contains the
load report in one of the formats listed above.

# ORCA Integration Considerations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this section also needs to be updated, since it no longer seems that relevant as currently written. In general, I think users are not going to choose a sidecar or library model based solely on ORCA; instead, they will probably have one of those models already and will want to know what the options are for adding ORCA support.

I think what we should be explaining here is that there are cases where a smart data plane like Envoy or gRPC can handle ORCA for you, transparently to the application, and there are other cases where the application will need to generate the ORCA data directly.


## Sidecar

A sidecar, e.g. Envoy or some other in-container proxy process, can inject the inline ORCA header
at response completion.

![Sidecar Integration Diagram](./orca-spec-sidecar.jpg)

The advantage of a sidecar model is that the application requires no modification as in the example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is true only if the sidecar is capable of determining the metrics that the endpoint wishes to report to its clients. For example, this would probably work for things like CPU or memory utilization, which the sidecar can determine in a fairly generic way. But it won't work for application-specific things, like custom utilization metrics or query cost metrics -- those will always require cooperation on the part of the application.

service A above. A disadvantage of this approach is that it requires changes to how DPLB operators
operationally manage their backends, introducing a binary that will require upgrades and version
management. This might not be a concern in a service mesh architecture where a sidecar can be taken
as given.

Service B above links with an observability library such as OpenTelemetry that provides support for
custom metrics. This enables the sidecar to collect additional telemetry data for the ORCA reports.

## Library

The application will link in a library for ORCA that will assist with generating the inline response
header.

![Library Integration Diagram](./orca-spec-library.jpg)

This will require DPLB operators to modify their applications to call the ORCA library and inject
the results into the HTTP response.

The library will provide an interface such as:

`std::string GetLoadReportHeaderValue(const std::unordered_map<std::string, double>& named_metrics);`

This method will combine the core metrics such as CPU and Memory (computed by the library) and,
together with the extra application-specific metrics, compute the inline ORCA response header. The
application will then be responsible for including this in its response.

# Known DPLB Implementations

## gRPC
TODO: [Repository]()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


[Design](https://github.com/grpc/proposal/blob/master/A51-custom-backend-metrics.md)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Envoy

[Repository](https://github.com/envoyproxy/envoy/tree/main/source/common/orca)

[Design](http://docs/document/d/1gb_2pcNnEzTgo1EJ6w1Ol7O-EH-O_Ysu5o215N9MTAg?tab=t.0#heading=h.do9yfa1wlpk8)

# Change History

|Date|Description|
|----|-----------|
|Oct 25, 2024|Updated specification in this document.|
|April 16, 2019|Original [specification](https://docs.google.com/document/d/1NSnK3346BkBo1JUU3I9I5NYYnaJZQPt8_Z_XCBCI3uA/edit?tab=t.0) proposed to the Envoy community.|