You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The Orchestrator uses an internal queue to manage instances for discovery. When an instance is ready for discovery, it gets added to the queue. Discovery workers process the queue. The `DiscoveryMaxConcurrency` setting in a configuration file controls the number of workers. This setting determines how many discoveries can happen in parallel.
4
+
5
+
The Orchestrator uses this mechanism to periodically monitor all instances. `InstancePollSeconds` configuration parameter says how often the Orchestrator should refresh the information.
6
+
7
+
When there is a lot of inaccessible or unhealthy instances, the Orchestrator may lose the proper view of the cluster and be late with needed recovery actions. This is because discoveries of such instances may take a long time and finish with failure anyway, at the same time consuming workers from the discovery workers pool. Healthy instances wait in the queue and they are not checked in a timely manner.
8
+
9
+
To avoid this, Orchestrator can be configured to maintain a separate discovery queue for unhealthy instances. This queue is processed by a separate pool of workers. Additionally, an exponential time backoff mechanism can be applied for rechecking such instances.
10
+
11
+
Configuration example:
12
+
```json
13
+
{
14
+
"DeadInstanceDiscoveryMaxConcurrency": 100,
15
+
"DeadInstancePollSecondsMultiplyFactor": 1.5,
16
+
"DeadInstancePollSecondsMax": 60,
17
+
"DeadInstanceDiscoveryLogsEnabled": true
18
+
}
19
+
```
20
+
21
+
`DeadInstanceDiscoveryMaxConcurrency` (default: 0) - Determines the number of discovery workers dedicated to dead instances. If this pool size is grater than 0, the Orchestrator maintains a separate queue for dead instances.
22
+
23
+
`DeadInstancePollSecondsMultiplyFactor` (default: 1) - Floating point number, allowed values are >= 1. Determines how aggressive the backoff mechanism is. By default, when `DeadInstancePollSecondsMultiplyFactor = 1`, the instance is checked every `InstancePollSeconds` seconds. If the parameter value is greater than 1, every consecutive try `n` is done after the period calculated according to the formula:
Note that `DeadInstanceDiscoveryMaxConcurrency` controls if the separate pool of discovery workers is created but has no impact on the backoff mechanism controlled by `DeadInstancePollSecondsMultiplyFactor`. It has the following implications:
50
+
51
+
1.`DeadInstanceDiscoveryMaxConcurrency > 0` and `DeadInstancePollSecondsMultiplyFactor > 1`:\
52
+
The separate discovery queue for dead instances is created, and dead instances are checked by a dedicated pool of go workers, and the instance is checked with exponential backoff mechanism time
53
+
2.`DeadInstanceDiscoveryMaxConcurrency = 0` and `DeadInstancePollSecondsMultiplyFactor > 1`:\
54
+
No separate discovery queue for dead instances is created, and dead instances are checked by the same pool of go workers as healthy instances however, an exponential backoff mechanism is applied for dead instances
55
+
3.`DeadInstanceDiscoveryMaxConcurrency > 0` and `DeadInstancePollSecondsMultiplyFactor = 1`:\
56
+
A separate discovery queue for dead instances is created, and dead instances are checked by a dedicated pool of go workers. No exponential backoff mechanism is applied for dead instances
57
+
4.`DeadInstanceDiscoveryMaxConcurrency = 0` and `DeadInstancePollSecondsMultiplyFactor = 1`:\
58
+
There is no separate discovery queue for dead instances, no dedicated go workers, no backoff mechanism. This is the default working mode.
59
+
60
+
`DeadInstancePollSecondsMax` (default: 300) - Controls the maximum time for backoff mechanism. If the backoff calculation goes beyond this value, it is considered as saturated and stays at `DeadInstancePollSecondsMax`
61
+
62
+
## Diagnostics
63
+
Orchestrator provides `debug/metrics` web endpoint for diagnostics.
64
+
65
+
`discoveries.dead_instances` - provides the number of instances currently registered as dead.\
66
+
`discoveries.dead_instances_queue_length` - provides the current length of the queue dedicate for dead instances. Note this is valid only when `DeadInstanceDiscoveryMaxConcurrency > 0`, so when a separate queue is used. In other cases it is always zero.
67
+
68
+
Other diagnostics endpoints:
69
+
70
+
`api/discovery-queue-metrics-raw/:seconds` - provides the raw metrics for a given time for the `DEFAULT` discovery queue.\
71
+
`api/discovery-queue-metrics-raw/:queue/:seconds` - provides the raw metrics for a given time for the supplied (`DEFAULT` or `DEADINSTANCES`) discovery queue.\
72
+
`discovery-queue-metrics-aggregated/:seconds` - provides aggregated metrics for a given time for the `DEFAULT` discovery queue.\
73
+
`discovery-queue-metrics-aggregated/:queue/:seconds` - provides aggregated metrics for a given time for the supplied (`DEFAULT` or `DEADINSTANCES`) discovery queue.
74
+
75
+
76
+
Note that `DEADINSTANCES` queue is available only if `DeadInstanceDiscoveryMaxConcurrency > 0`
77
+
78
+
## Logging
79
+
Logging of dead instances discovery process is controlled vial `DeadInstanceDiscoveryLogsEnabled` bool parameter. It is disabled by default.
Copy file name to clipboardExpand all lines: go/config/config.go
+15Lines changed: 15 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -142,6 +142,10 @@ type Configuration struct {
142
142
DiscoverByShowSlaveHostsbool// Attempt SHOW SLAVE HOSTS before PROCESSLIST
143
143
UseSuperReadOnlybool// Should orchestrator super_read_only any time it sets read_only
144
144
InstancePollSecondsuint// Number of seconds between instance reads
145
+
DeadInstancePollSecondsMultiplyFactorfloat32// InstancePoolSeconds increase factor for dead instances read time calculation
146
+
DeadInstancePollSecondsMaxuint// Maximum delay between dead instance read attempts
147
+
DeadInstanceDiscoveryMaxConcurrencyuint// Number of goroutines doing dead hosts discovery
148
+
DeadInstanceDiscoveryLogsEnabledbool// Enable logs related to dead instances discoveries
145
149
ReasonableInstanceCheckSecondsuint// Number of seconds an instance read is allowed to take before it is considered invalid, i.e. before LastCheckValid will be false
146
150
InstanceWriteBufferSizeint// Instance write buffer size (max number of instances to flush in one INSERT ODKU)
147
151
BufferInstanceWritesbool// Set to 'true' for write-optimization on backend table (compromise: writes can be stale and overwrite non stale data)
0 commit comments