Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for QoS: Quality of Service #58

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added media/QoS-capacity-slice.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added media/qos-architecture.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
172 changes: 172 additions & 0 deletions text/2020-09-24-QoS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
# QoS: Quality of Service

## Motivation

Queries compete for resources and thus interfere with each other. Currently users can only deal with this in very time consuming ways by either increasing cluster capacity or altering their applications, the latter of which may take hours or days.

Users want to ensure a quality of service for their queries. Some queries should be prioritized above others. For queries of the same priority, resources should be divided fairly among queries. When there are multiple tenants, provide resource isolation but still allow for high utilization.

## Summary

This solution provide QoS at the level of the TiKV node. QoS is configured both globally in PD and dynamically by clients.

* QoS Policy is set in PD for region groups such as key spaces (tenant) and tables
* Larger region groups have more capacity allocated
* Allow an application (TiDB) to create its own policies by sending a QoS-Request that further prioritizes its own capacity.
* Analytics queries can request a low QoS
* Apply local back pressure on a TiKV node by rejecting queries using too much capacity

![QoS Architecture](../media/qos-architecture.png)

![QoS Capacity Slicing](../media/media/QoS-capacity-slice.png)


## Terminology

* QoS: a relative priority setting. This is not a quota: usage is always “bursting” to achieve high utilization.
* Capacity: the total resources available to be prioritized
* Key Space: in a multi-tenant setup, every tenant gets a distinct key space. More generally a key space is designed for applications with different data ownership.


## Detailed design

### Architectural and Implementation advantages

Ti Components are loosely coupled:
* PD stores policies and communicates them to TiKV
* TiKV performs query admission, providing localized back pressure
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably TiFlash would work in the same way as TiKV

Copy link
Contributor Author

@gregwebs gregwebs Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't thought about TiFlash. Managing QoS for the OLTP workload path is critical. For OLAP it is less important. TiFlash is also gaining some direct write support, but I have no idea how that works. As the MPP support for TiFlash improves it will be easier to handle TiFlash load by scaling out. Additionally, applications that benefit from TiFlash would generally be big enough to have their own TiDB cluster. This proposal will benefit smaller applications the most that must use a shared TiDB cluster until they grow larger.

* TiDB can create its own QoS policies for its users/tables just by sending a header

Iterative. We can try to produce a useful first version without:
* Bursting
* Global Fairness with adjusted weighting and a PD placement policy
* Back Pressure fairness with detailed resource usage measurements


This is designed to be a minimal step towards supporting QoS sensitive workloads such as multi-tenant. Future work will be needed to create an improved scheduler and to improve global fairness.

### TiKV Back Pressure

#### Local Back Pressure at TiKV

TiKV will have an admission controller component. This component will track the QoS status and reject queries before they are accepted.

The downside to following this approach is that TiKV does not understand a multi-node query. One node blocking a query can slow down a larger transaction and end up slowing down the system as a whole. Trying to give TiKV global information won’t scale up well for a large cluster.

#### Query inhibition

Queries should be inhibited based on
* The total capacity available on the node
* The QoS policies that apply to the query
* The estimated resources needed for the queries

#### Resource Estimation

The amount of inhibition required depends on the number of requests and amount of resources being requested. Effectively when resources are highly utilized we build up a queue of pending requests with a limited size where the overflow is rejected.

Policy application is allowed to take into account resources that will be used
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you talk about prioritisation of queries but in the above section it sounds like TiKV just has a run/reject binary for queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this was not written clearly. To decide what queries to admit we need to apply QoS policies. But admission can also take into account the resources being used. Once admitted I think we will just do policy (region-based) prioritization. This should be fleshed out more.


* less intensive queries can be prioritized above more intensive queries, particularly for bursting
* queries can be prioritized that together make for better resource utilization given the multiple dimensions of resource usage.

#### Resource measurement

TiKV must measure the resource usage of the node as a whole.
However, in our first version we do not take into account the actual usage of different policies. To improve our ability to estimate resource usage we will need to develop the ability to measure the actual resources used of policies being applied. These measurements can eventually be used to apply QoS more intelligently. For example, the effects of bad estimates can be corrected.

### QoS Policy

#### QoS Value and composition

QoS is specified as an integer value on a linear scale. A greater value reflects a greater priority and a value twice as large is twice as high of a priority. Negative values are effectively treated as a fraction between 0 and 1.

QoS values can compose in two different ways (these are also discussed in later sections)
* Inner Override (replace): a table QoS value overrides a keyspace QoS setting
8 Inner Prioritization (greater specificity): a custom application request QoS value is a priority relative to other application requests. The application as a whole is still governed by the keyspace QoS value

#### QoS Policy stored in PD

A QoS policy is set by an administrator in PD. It is a combination of a region group and a QoS value. The main region group is a key space. Smaller regions within a key space may be specified such as a table and this QoS setting will take precedence over that of the key space. These groups are dynamic (new regions can be added) and translated to regions by PD which has knowledge of tenant and table groupings.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably when a region splits, it inherits the QoS parameters from its parents. What happens when two regions with different QoS are merged?

Does PD have knowledge of how tables/tenants are represented within a key space? My assumption is that only TiDB knows this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little fuzzy on this detail, but I know we can now prevent tables from sharing regions, so it should be possible for PD to know this.

Copy link
Member

@BusyJay BusyJay Oct 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps QoS police should bind to a range instead of some regions, just like placement rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps QoS police should bind to a range instead of some regions, just like placement rules.

This proposal doesn't closely specify how regions will be grouped. Grouping by key range could be a great solution. The problem with this approach is that a region could span multiple key ranges. We could try to use key range as the underlying primitive but also try to have APIs that talk in terms of key spaces. We can also reject key ranges that already don't fully enclose their existing regions. It is noted in placement rules that the key range of a table can change due to DDL commands. So I am thinking that for the first version of QoS PD can understand key spaces but won't understand tables and may need to accept key ranges.


A default QoS request setting may be provided for applications that send a QoS value per request, see the QoS Request section.

These QoS policies must be periodically (perhaps once a minute) communicated to TiKV.

A first implementation can assume that all regions have the same QoS.

#### QoS Request: Custom Application Policies

In TiDB we would like to attach a QoS to a user, a role, or some other TiDB specific object. These application-specific policies should remain in TiDB rather than being pushed down to PD.

The application will already have a QoS relative to other applications based on the number of regions in its key space and the QoS setting. Application-specific policies allow queries within the same QoS to be prioritized differently.

Custom application policies are sent to TiKV by setting a “QoS-Request” field. The QoS request is relative to other requests using the same QoS Key and the value is not compared to the region-based QoS that it is dividing.

A default value for QoS-Request may be set in PD as part of the QoS policy for a region group. Otherwise a default value is assumed.

In a delegated authentication setup, the QoS-Request field should be received in a signed auth token. This proves the QoS was negotiated with the application owner.

#### Policy Application

When a node approaches full utilization capacity it inhibits queries by prioritizing queries based on QoS Policy. QoS Policy is a combination of recorded settings in PD and dynamic QoS requests that determine the relative QoS share of different region groups.

The QoS share of a region group is specified by the QoS Policy in PD for that group or a low priority request QoS setting. That QoS is multiplied by the total number of regions in the group on the node so that larger groups get assigned more capacity.

Queries of groups that are being inhibited then are prioritized according to their QoS request value (if these values are sent).

## Global Fairness

### Dynamic QoS adjustment by PD

Instead of altering physical data placement for fairness, PD can dynamically adjust the QoS value of regions.
PD can tell a TiKV node that the QoS for a hot region is larger to make up for cold regions not utilizing capacity on another node.
When this can be used we can think of this as achieving the same effect as placement without having to move the data.
This approach is explained and measured in [this paper](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-215.pdf) as adjusting local per-tenant weight (QoS).
Data placement happens on a long term time scale whereas dynamic adjustment can happen every few seconds.

### Adjusted Follower Read

The proposal assumes a single leader.
When using follower read, we should take it into account with our QoS weighting similar to outlined in [this paper](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-215.pdf) as replica selection.

### PD Placement for multi-tenancy

This QoS solution is expected to perform poorly in the following scenario:
* multi-tenant where a tenant has few regions on a node
* a user has hot regions on one node and cold regions on another

Here the user will not get to share capacity between their hot and cold regions.
We can solve this with group-based node placement.

A group are regions of a single key space (tenant) that is a balance of hot and cold regions.
If multiple hotspot regions are in the same group, we should balance these regions to other groups.

Leaders of regions in the same group are placed on the same node. For a small user with just one region group, this placement reduces the liklihood of a small availability incident occurring but greatly increases the probability of a large availability incident, which is undesireable. For a large user with many groups the overall availability may not be changed.

See this [PD Github Issue](https://github.com/tikv/pd/issues/2950).



## Drawbacks

* No Global perspective
* Because different queries operate on different nodes, a query with a lower QoS request may effectively be prioritized above a query with a higher QoS.
* Tenants will experience degraded QoS due to tenant conflict in some cases, but this can be mitigated by rebalancing
* No integration with a resource scheduler
* No ability to stop queries once started



## Alternatives

Static quota enforcement. Users may prefer to communicate about QoS in terms of quota guarantees. However, static quotas can be inferred from QoS. A Quota is the division of capacity according to the QoS settings.

Bursting is important for high utilization. With QoS it is clear what should happend with bursting. With quotas there must be some assumptions about priority.


## Unresolved questions

The exact way to communicate that queries are rejected has not been specified yet. Clients should be able to recognize that they are getting rejected due to overloading of the server.

The work required in the scheduler to allow for fair usage of resources and to stop queries that are using too many resources is unknown. Overall work will be limited for this proposal and improving resource scheduling will continue as an independent long-term project.