-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for QoS: Quality of Service #58
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
# QoS: Quality of Service | ||
|
||
## Motivation | ||
|
||
Queries compete for resources and thus interfere with each other. Currently users can only deal with this in very time consuming ways by either increasing cluster capacity or altering their applications, the latter of which may take hours or days. | ||
|
||
Users want to ensure a quality of service for their queries. Some queries should be prioritized above others. For queries of the same priority, resources should be divided fairly among queries. When there are multiple tenants, provide resource isolation but still allow for high utilization. | ||
|
||
## Summary | ||
|
||
This solution provide QoS at the level of the TiKV node. QoS is configured both globally in PD and dynamically by clients. | ||
|
||
* QoS Policy is set in PD for region groups such as key spaces (tenant) and tables | ||
* Larger region groups have more capacity allocated | ||
* Allow an application (TiDB) to create its own policies by sending a QoS-Request that further prioritizes its own capacity. | ||
* Analytics queries can request a low QoS | ||
* Apply local back pressure on a TiKV node by rejecting queries using too much capacity | ||
|
||
![QoS Architecture](../media/qos-architecture.png) | ||
|
||
![QoS Capacity Slicing](../media/media/QoS-capacity-slice.png) | ||
|
||
|
||
## Terminology | ||
|
||
* QoS: a relative priority setting. This is not a quota: usage is always “bursting” to achieve high utilization. | ||
* Capacity: the total resources available to be prioritized | ||
* Key Space: in a multi-tenant setup, every tenant gets a distinct key space. More generally a key space is designed for applications with different data ownership. | ||
|
||
|
||
## Detailed design | ||
|
||
### Architectural and Implementation advantages | ||
|
||
Ti Components are loosely coupled: | ||
* PD stores policies and communicates them to TiKV | ||
* TiKV performs query admission, providing localized back pressure | ||
* TiDB can create its own QoS policies for its users/tables just by sending a header | ||
|
||
Iterative. We can try to produce a useful first version without: | ||
* Bursting | ||
* Global Fairness with adjusted weighting and a PD placement policy | ||
* Back Pressure fairness with detailed resource usage measurements | ||
|
||
|
||
This is designed to be a minimal step towards supporting QoS sensitive workloads such as multi-tenant. Future work will be needed to create an improved scheduler and to improve global fairness. | ||
|
||
### TiKV Back Pressure | ||
|
||
#### Local Back Pressure at TiKV | ||
|
||
TiKV will have an admission controller component. This component will track the QoS status and reject queries before they are accepted. | ||
|
||
The downside to following this approach is that TiKV does not understand a multi-node query. One node blocking a query can slow down a larger transaction and end up slowing down the system as a whole. Trying to give TiKV global information won’t scale up well for a large cluster. | ||
|
||
#### Query inhibition | ||
|
||
Queries should be inhibited based on | ||
* The total capacity available on the node | ||
* The QoS policies that apply to the query | ||
* The estimated resources needed for the queries | ||
|
||
#### Resource Estimation | ||
|
||
The amount of inhibition required depends on the number of requests and amount of resources being requested. Effectively when resources are highly utilized we build up a queue of pending requests with a limited size where the overflow is rejected. | ||
|
||
Policy application is allowed to take into account resources that will be used | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here you talk about prioritisation of queries but in the above section it sounds like TiKV just has a run/reject binary for queries. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this was not written clearly. To decide what queries to admit we need to apply QoS policies. But admission can also take into account the resources being used. Once admitted I think we will just do policy (region-based) prioritization. This should be fleshed out more. |
||
|
||
* less intensive queries can be prioritized above more intensive queries, particularly for bursting | ||
* queries can be prioritized that together make for better resource utilization given the multiple dimensions of resource usage. | ||
|
||
#### Resource measurement | ||
|
||
TiKV must measure the resource usage of the node as a whole. | ||
However, in our first version we do not take into account the actual usage of different policies. To improve our ability to estimate resource usage we will need to develop the ability to measure the actual resources used of policies being applied. These measurements can eventually be used to apply QoS more intelligently. For example, the effects of bad estimates can be corrected. | ||
|
||
### QoS Policy | ||
|
||
#### QoS Value and composition | ||
|
||
QoS is specified as an integer value on a linear scale. A greater value reflects a greater priority and a value twice as large is twice as high of a priority. Negative values are effectively treated as a fraction between 0 and 1. | ||
|
||
QoS values can compose in two different ways (these are also discussed in later sections) | ||
* Inner Override (replace): a table QoS value overrides a keyspace QoS setting | ||
8 Inner Prioritization (greater specificity): a custom application request QoS value is a priority relative to other application requests. The application as a whole is still governed by the keyspace QoS value | ||
|
||
#### QoS Policy stored in PD | ||
|
||
A QoS policy is set by an administrator in PD. It is a combination of a region group and a QoS value. The main region group is a key space. Smaller regions within a key space may be specified such as a table and this QoS setting will take precedence over that of the key space. These groups are dynamic (new regions can be added) and translated to regions by PD which has knowledge of tenant and table groupings. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Presumably when a region splits, it inherits the QoS parameters from its parents. What happens when two regions with different QoS are merged? Does PD have knowledge of how tables/tenants are represented within a key space? My assumption is that only TiDB knows this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a little fuzzy on this detail, but I know we can now prevent tables from sharing regions, so it should be possible for PD to know this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps QoS police should bind to a range instead of some regions, just like placement rules. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This proposal doesn't closely specify how regions will be grouped. Grouping by key range could be a great solution. The problem with this approach is that a region could span multiple key ranges. We could try to use key range as the underlying primitive but also try to have APIs that talk in terms of key spaces. We can also reject key ranges that already don't fully enclose their existing regions. It is noted in placement rules that the key range of a table can change due to DDL commands. So I am thinking that for the first version of QoS PD can understand key spaces but won't understand tables and may need to accept key ranges. |
||
|
||
A default QoS request setting may be provided for applications that send a QoS value per request, see the QoS Request section. | ||
|
||
These QoS policies must be periodically (perhaps once a minute) communicated to TiKV. | ||
|
||
A first implementation can assume that all regions have the same QoS. | ||
|
||
#### QoS Request: Custom Application Policies | ||
|
||
In TiDB we would like to attach a QoS to a user, a role, or some other TiDB specific object. These application-specific policies should remain in TiDB rather than being pushed down to PD. | ||
|
||
The application will already have a QoS relative to other applications based on the number of regions in its key space and the QoS setting. Application-specific policies allow queries within the same QoS to be prioritized differently. | ||
|
||
Custom application policies are sent to TiKV by setting a “QoS-Request” field. The QoS request is relative to other requests using the same QoS Key and the value is not compared to the region-based QoS that it is dividing. | ||
|
||
A default value for QoS-Request may be set in PD as part of the QoS policy for a region group. Otherwise a default value is assumed. | ||
|
||
In a delegated authentication setup, the QoS-Request field should be received in a signed auth token. This proves the QoS was negotiated with the application owner. | ||
|
||
#### Policy Application | ||
|
||
When a node approaches full utilization capacity it inhibits queries by prioritizing queries based on QoS Policy. QoS Policy is a combination of recorded settings in PD and dynamic QoS requests that determine the relative QoS share of different region groups. | ||
|
||
The QoS share of a region group is specified by the QoS Policy in PD for that group or a low priority request QoS setting. That QoS is multiplied by the total number of regions in the group on the node so that larger groups get assigned more capacity. | ||
|
||
Queries of groups that are being inhibited then are prioritized according to their QoS request value (if these values are sent). | ||
|
||
## Global Fairness | ||
|
||
### Dynamic QoS adjustment by PD | ||
|
||
Instead of altering physical data placement for fairness, PD can dynamically adjust the QoS value of regions. | ||
PD can tell a TiKV node that the QoS for a hot region is larger to make up for cold regions not utilizing capacity on another node. | ||
When this can be used we can think of this as achieving the same effect as placement without having to move the data. | ||
This approach is explained and measured in [this paper](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-215.pdf) as adjusting local per-tenant weight (QoS). | ||
Data placement happens on a long term time scale whereas dynamic adjustment can happen every few seconds. | ||
|
||
### Adjusted Follower Read | ||
|
||
The proposal assumes a single leader. | ||
When using follower read, we should take it into account with our QoS weighting similar to outlined in [this paper](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-215.pdf) as replica selection. | ||
|
||
### PD Placement for multi-tenancy | ||
|
||
This QoS solution is expected to perform poorly in the following scenario: | ||
* multi-tenant where a tenant has few regions on a node | ||
* a user has hot regions on one node and cold regions on another | ||
|
||
Here the user will not get to share capacity between their hot and cold regions. | ||
We can solve this with group-based node placement. | ||
|
||
A group are regions of a single key space (tenant) that is a balance of hot and cold regions. | ||
If multiple hotspot regions are in the same group, we should balance these regions to other groups. | ||
|
||
Leaders of regions in the same group are placed on the same node. For a small user with just one region group, this placement reduces the liklihood of a small availability incident occurring but greatly increases the probability of a large availability incident, which is undesireable. For a large user with many groups the overall availability may not be changed. | ||
|
||
See this [PD Github Issue](https://github.com/tikv/pd/issues/2950). | ||
|
||
|
||
|
||
## Drawbacks | ||
|
||
* No Global perspective | ||
* Because different queries operate on different nodes, a query with a lower QoS request may effectively be prioritized above a query with a higher QoS. | ||
* Tenants will experience degraded QoS due to tenant conflict in some cases, but this can be mitigated by rebalancing | ||
* No integration with a resource scheduler | ||
* No ability to stop queries once started | ||
|
||
|
||
|
||
## Alternatives | ||
|
||
Static quota enforcement. Users may prefer to communicate about QoS in terms of quota guarantees. However, static quotas can be inferred from QoS. A Quota is the division of capacity according to the QoS settings. | ||
|
||
Bursting is important for high utilization. With QoS it is clear what should happend with bursting. With quotas there must be some assumptions about priority. | ||
|
||
|
||
## Unresolved questions | ||
|
||
The exact way to communicate that queries are rejected has not been specified yet. Clients should be able to recognize that they are getting rejected due to overloading of the server. | ||
|
||
The work required in the scheduler to allow for fair usage of resources and to stop queries that are using too many resources is unknown. Overall work will be limited for this proposal and improving resource scheduling will continue as an independent long-term project. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably TiFlash would work in the same way as TiKV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't thought about TiFlash. Managing QoS for the OLTP workload path is critical. For OLAP it is less important. TiFlash is also gaining some direct write support, but I have no idea how that works. As the MPP support for TiFlash improves it will be easier to handle TiFlash load by scaling out. Additionally, applications that benefit from TiFlash would generally be big enough to have their own TiDB cluster. This proposal will benefit smaller applications the most that must use a shared TiDB cluster until they grow larger.