This document describes recommendations for performance tuning of the Diego Data Store.
- Component scaling guidelines
- BBS Performance Tuning
- Locket Performance Tuning
- SQL Performance Tuning
- Compensating for Envoy memory overhead
The following components must be scaled vertically (more CPU cores and/or memory). Scaling them horizontally does not make sense since there is only one instance active at any given point in time:
auctioneer
bbs
route_emitter
(only when running in global mode as opposed to cell-local mode)
The following components can be scaled horizontally as well as vertically:
file_server
locket
rep
rep_windows
route_emitter
(only when running in cell-local mode)route_emitter_windows
(only when running in cell-local mode)ssh_proxy
The following jobs require more considered planning:
bbs
:- It is NOT recommended to use burstable performance VMs, such as AWS
t2
-family instances. - The performance of the BBS depends significantly on the performance of its SQL database. A less performant SQL backend could reduce the throughput and increase the latency of the BBS requests.
- The BBS activity from API request load and internal activity are both directly proportional to the total number of running app instances (or running ActualLRPs, in pure Diego terms). If the number of instances that the deployment supports increases without a corresponding increase in VM resources, BBS API response times may increase instead.
- It is NOT recommended to use burstable performance VMs, such as AWS
rep
:- Although the
rep
is a horizontally scalable component, the resources available to eachrep
on its VM (typically called a "Diego cell") affect the total number of app instance and task containers that can run on that VM. For example, if therep
is running on a VM with 20GB of memory, it can only run 20 app instances that each have a 1-GB memory limit. This constraint also applies to available disk capacity. - In case it is not possible for an operator to deploy larger cell VMs or to increase the number of cell VMs, an operator can overcommit memory and disk by setting the following properties on the
rep
job:diego.executor.memory_capacity_mb
diego.executor.disk_capacity_mb
Operators that overcommit cell capacity should be extremely careful not to run out of physical memory or disk capacity on the cells.
- Although the
locket
:- It is NOT recommended to use burstable performance VMs, such as AWS
t2
-family instances. - The performance of the Locket instances depends significantly on the performance of its SQL database. A less performant SQL backend could reduce the throughput and increase the latency of the Locket requests, which may in turn affect the availability of services such as the BBS, the auctioneer, and the cell reps that maintain locks and presences in Locket.
- Note: Although
locket
is a horizontally scalable job, in cf-deployment it is deployed on thediego-api
instance group along with thebbs
job. In that case we recommend still to scale the instance group vertically.
- It is NOT recommended to use burstable performance VMs, such as AWS
The Diego team currently benchmarks the BBS and Locket together on a VM with 16 CPU cores and 60GB memory. The MySQL and Postgres backends have the same number of cores and memory. This setup can handle load from 1000 simulated cells (running rep
and route-emitter
) with a total of 250K LRPs.
The maximum number of connections from the active BBS to the SQL database can be set using the diego.bbs.sql.max_open_connections
property on the bbs
job, and the maximum number of idle connections can be set using diego.bbs.sql.max_idle_connections
. By default diego.bbs.sql.max_idle_connections
is set to the same value as diego.bbs.sql.max_open_connections
to avoid recreating connections to the database uneccesarily.
The maximum number of connections from each Locket instance to the database can be set using the database.max_open_connections
property on the locket
job. Unlike the BBS, the Locket job does not permit the maximum number of idle connections to be set independently, and always sets it to the same value as database.max_open_connections
.
In a cf-deployment-based CF cluster, an operator can the maximum number of connections from Diego components (BBS and Locket) to the SQL backend using the following formula:
<diego.bbs.sql.max_open_connections> + <database.max_open_connections> * <number of diego-api instances>
- The
diego.bbs.sql.max_open_connections
parameter contributes only once because there is only one active BBS instance. - The actual number of active connections may be significantly lower than this maximum, depending on the scale of the app workload that the CF cluster supports.
- If other components connect to the same SQL database you will need to add their maximum number of connections to get an accurate figure.
Operators can use the following cf-deployment-compatible operations files to tune their MySQL or Postgres databases to support a large CF cluster:
- MySQL: mysql.yml
- Postgres: postgres.yml
These operations files are the ones used in the Diego team's 250K-instance benchmark tests, and operators may freely change the sizing and scaling parameters in them to match the resource needs of their own CF clusters.