tikv · gregwebs · Sep 24, 2020 · Sep 24, 2020 · Oct 9, 2020 · nrc
diff --git a/media/QoS-capacity-slice.png b/media/QoS-capacity-slice.png
diff --git a/media/qos-architecture.png b/media/qos-architecture.png
diff --git a/text/2020-09-24-QoS.md b/text/2020-09-24-QoS.md
@@ -0,0 +1,172 @@
+# QoS: Quality of Service
+
+## Motivation
+
+Queries compete for resources and thus interfere with each other. Currently users can only deal with this in very time consuming ways by either increasing cluster capacity or altering their applications, the latter of which may take hours or days.
+
+Users want to ensure a quality of service for their queries. Some queries should be prioritized above others. For queries of the same priority, resources should be divided fairly among queries. When there are multiple tenants, provide resource isolation but still allow for high utilization.
+
+## Summary
+
+This solution provide QoS at the level of the TiKV node. QoS is configured both globally in PD and dynamically by clients.
+
+  * QoS Policy is set in PD for region groups such as key spaces (tenant) and tables
+    * Larger region groups have more capacity allocated
+  * Allow an application (TiDB) to create its own policies by sending a QoS-Request that further prioritizes its own capacity.
+  * Analytics queries can request a low QoS
+  * Apply local back pressure on a TiKV node by rejecting queries using too much capacity
+
+![QoS Architecture](../media/qos-architecture.png)
+
+![QoS Capacity Slicing](../media/media/QoS-capacity-slice.png)
+
+
+## Terminology
+
+* QoS: a relative priority setting. This is not a quota: usage is always “bursting” to achieve high utilization.
+* Capacity: the total resources available to be prioritized
+* Key Space: in a multi-tenant setup, every tenant gets a distinct key space. More generally a key space is designed for applications with different data ownership.
+
+
+## Detailed design
+
+### Architectural and Implementation advantages
+
+Ti Components are loosely coupled:
+  * PD stores policies and communicates them to TiKV
+  * TiKV performs query admission, providing localized back pressure
+  * TiDB can create its own QoS policies for its users/tables just by sending a header
+
+Iterative. We can try to produce a useful first version without:
+  * Bursting
+  * Global Fairness with adjusted weighting and a PD placement policy
+  * Back Pressure fairness with detailed resource usage measurements
+
+
+This is designed to be a minimal step towards supporting QoS sensitive workloads such as multi-tenant. Future work will be needed to create an improved scheduler and to improve global fairness.
+
+### TiKV Back Pressure
+
+#### Local Back Pressure at TiKV
+
+TiKV will have an admission controller component. This component will track the QoS status and reject queries before they are accepted.
+
+The downside to following this approach is that TiKV does not understand a multi-node query. One node blocking a query can slow down a larger transaction and end up slowing down the system as a whole. Trying to give TiKV global information won’t scale up well for a large cluster.
+
+#### Query inhibition
+
+Queries should be inhibited based on
+* The total capacity available on the node
+* The QoS policies that apply to the query
+* The estimated resources needed for the queries
+
+#### Resource Estimation
+
+The amount of inhibition required depends on the number of requests and amount of resources being requested. Effectively when resources are highly utilized we build up a queue of pending requests with a limited size where the overflow is rejected.
+
+Policy application is allowed to take into account resources that will be used
+
+* less intensive queries can be prioritized above more intensive queries, particularly for bursting
+* queries can be prioritized that together make for better resource utilization given the multiple dimensions of resource usage.
+
+#### Resource measurement
+
+TiKV must measure the resource usage of the node as a whole.
+However, in our first version we do not take into account the actual usage of different policies. To improve our ability to estimate resource usage we will need to develop the ability to measure the actual resources used of policies being applied. These measurements can eventually be used to apply QoS more intelligently. For example, the effects of bad estimates can be corrected.
+
+### QoS Policy
+
+#### QoS Value and composition
+
+QoS is specified as an integer value on a linear scale. A greater value reflects a greater priority and a value twice as large is twice as high of a priority. Negative values are effectively treated as a fraction between 0 and 1.
+
+QoS values can compose in two different ways (these are also discussed in later sections)
+* Inner Override (replace): a table QoS value overrides a keyspace QoS setting
+8 Inner Prioritization (greater specificity): a custom application request QoS value is a priority relative to other application requests. The application as a whole is still governed by the keyspace QoS value
+
+#### QoS Policy stored in PD
+
+A QoS policy is set by an administrator in PD. It is a combination of a region group and a QoS value. The main region group is a key space. Smaller regions within a key space may be specified such as a table and this QoS setting will take precedence over that of the key space. These groups are dynamic (new regions can be added) and translated to regions by PD which has knowledge of tenant and table groupings.
+
+A default QoS request setting may be provided for applications that send a QoS value per request, see the QoS Request section.
+
+These QoS policies must be periodically (perhaps once a minute) communicated to TiKV.
+
+A first implementation can assume that all regions have the same QoS.
+
+#### QoS Request: Custom Application Policies
+
+In TiDB we would like to attach a QoS to a user, a role, or some other TiDB specific object. These application-specific policies should remain in TiDB rather than being pushed down to PD.
+
+The application will already have a QoS relative to other applications based on the number of regions in its key space and the QoS setting. Application-specific policies allow queries within the same QoS to be prioritized differently. 
+
+Custom application policies are sent to TiKV by setting a “QoS-Request” field. The QoS request is relative to other requests using the same QoS Key and the value is not compared to the region-based QoS that it is dividing.
+
+A default value for QoS-Request may be set in PD as part of the QoS policy for a region group. Otherwise a default value is assumed.
+
+In a delegated authentication setup, the QoS-Request field should be received in a signed auth token. This proves the QoS was negotiated with the application owner.
+
+#### Policy Application
+
+When a node approaches full utilization capacity it inhibits queries by prioritizing queries based on QoS Policy. QoS Policy is a combination of recorded settings in PD and dynamic QoS requests that determine the relative QoS share of different region groups.
+
+The QoS share of a region group is specified by the QoS Policy in PD for that group or a low priority request QoS setting. That QoS is multiplied by the total number of regions in the group on the node so that larger groups get assigned more capacity.
+
+Queries of groups that are being inhibited then are prioritized according to their QoS request value (if these values are sent).
+
+## Global Fairness
+
+### Dynamic QoS adjustment by PD
+
+Instead of altering physical data placement for fairness, PD can dynamically adjust the QoS value of regions.
+PD can tell a TiKV node that the QoS for a hot region is larger to make up for cold regions not utilizing capacity on another node.
+When this can be used we can think of this as achieving the same effect as placement without having to move the data. 
+This approach is explained and measured in [this paper](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-215.pdf) as adjusting local per-tenant weight (QoS).
+Data placement happens on a long term time scale whereas dynamic adjustment can happen every few seconds.
+
+### Adjusted Follower Read
+
+The proposal assumes a single leader.
+When using follower read, we should take it into account with our QoS weighting similar to outlined in [this paper](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-215.pdf) as replica selection.
+
+### PD Placement for multi-tenancy
+
+This QoS solution is expected to perform poorly in the following scenario:
+  * multi-tenant where a tenant has few regions on a node
+  * a user has hot regions on one node and cold regions on another
+
+Here the user will not get to share capacity between their hot and cold regions.
+We can solve this with group-based node placement.
+
+A group are regions of a single key space (tenant) that is a balance of hot and cold regions.
+If multiple hotspot regions are in the same group, we should balance these regions to other groups.
+
+Leaders of regions in the same group are placed on the same node. For a small user with just one region group, this placement reduces the liklihood of a small availability incident occurring but greatly increases the probability of a large availability incident, which is undesireable. For a large user with many groups the overall availability may not be changed.
+
+See this [PD Github Issue](https://github.com/tikv/pd/issues/2950).
+
+
+
+## Drawbacks
+
+  * No Global perspective
+    * Because different queries operate on different nodes, a query with a lower QoS request may effectively be prioritized above a query with a higher QoS.
+    * Tenants will experience degraded QoS due to tenant conflict in some cases, but this can be mitigated by rebalancing
+  * No integration with a resource scheduler
+  * No ability to stop queries once started
+
+
+
+## Alternatives
+
+Static quota enforcement. Users may prefer to communicate about QoS in terms of quota guarantees. However, static quotas can be inferred from QoS. A Quota is the division of capacity according to the QoS settings.
+
+Bursting is important for high utilization. With QoS it is clear what should happend with bursting. With quotas there must be some assumptions about priority.
+
+
+## Unresolved questions
+
+The exact way to communicate that queries are rejected has not been specified yet. Clients should be able to recognize that they are getting rejected due to overloading of the server.
+
+The work required in the scheduler to allow for fair usage of resources and to stop queries that are using too many resources is unknown. Overall work will be limited for this proposal and improving resource scheduling will continue as an independent long-term project.
+