Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/WAF: failure zones refresh v1.15 manual backport (#21545) #21653

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
111 changes: 77 additions & 34 deletions website/content/docs/architecture/improving-consul-resilience.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,21 +5,19 @@ description: >-
Fault tolerance is a system's ability to operate without interruption despite component failure. Learn how a set of Consul servers provide fault tolerance through use of a quorum, and how to further improve control plane resilience through use of infrastructure zones and Enterprise redundancy zones.
---

# Fault Tolerance
# Fault tolerance

Fault tolerance is the ability of a system to continue operating without interruption
despite the failure of one or more components.
The most basic production deployment of Consul has 3 server agents and can lose a single
server without interruption.

As you continue to use Consul, your circumstances may change.
Perhaps a datacenter becomes more business critical or risk management policies change,
necessitating an increase in fault tolerance.
The sections below discuss options for how to improve Consul's fault tolerance.
You must give careful consideration to reliability in the architecture frameworks that you build. When you build a resilient platform, it minimizes the remediation actions you need to take when a failure occurs. This document provides useful information on how to design and operate a resilient Consul cluster, including the methods and functionalities for this goal.

Consul has many features that operate both locally and remotely that can help you offer a resilient service across multiple datacenters.

## Fault Tolerance in Consul

Consul's fault tolerance is determined by the configuration of its voting server agents.
## Introduction

Fault tolerance is the ability of a system to continue operating without interruption
despite the failure of one or more components. In Consul, the number of server agents determines the fault tolerance.


Each Consul datacenter depends on a set of Consul voting server agents.
The voting servers ensure Consul has a consistent, fault-tolerant state
Expand All @@ -42,28 +40,25 @@ number of servers, quorum, and fault tolerance, refer to the
[consensus protocol documentation](/consul/docs/architecture/consensus#deployment_table).

Effectively mitigating your risk is more nuanced than just increasing the fault tolerance
metric described above. You must consider:

### Correlated Risks
because the infrastructure costs can outweigh the improved resiliency. You must also consider correlated risks at the infrastructure-level. There are occasions when multiple servers fail at the same time. That means that a single failure could cause a Consul outage, even if your server-level fault tolerance is 2.

Are you protected against correlated risks? Infrastructure-level failures can cause multiple servers to fail at the same time. This means that a single infrastructure-level failure could cause a Consul outage, even if your server-level fault tolerance is 2.
Different options for your resilient datacenter present trade-offs between operational complexity, computing cost, and Consul request performance. Consider these factors when designing your resilient architecture.

### Mitigation Costs
## Fault tolerance

What are the costs of the mitigation? Different mitigation options present different trade-offs for operational complexity, computing cost, and Consul request performance.
The following sections explore several options for increasing Consul's fault tolerance. For enhanced reliability, we recommend taking a holistic approach by layering these multiple functionalities together.

## Strategies to Increase Fault Tolerance
- Spread servers across infrastructure [availability zones](#availability-zones).
- Use a [minimum quorum size](#quorum-size) to avoid performance impacts.
- Use [redundancy zones](#redundancy-zones) to improve fault tolerance. <EnterpriseAlert inline />
- Use [Autopilot](#autopilot) to automatically prune failed servers and maintain quorum size.
- Use [cluster peering](#cluster-peering) to provide service redundancy.

The following sections explore several options for increasing Consul's fault tolerance.
### Availability zones

HashiCorp recommends all production deployments consider:
- [Spreading Consul servers across availability zones](#spread-servers-across-infrastructure-availability-zones)
- <EnterpriseAlert inline /><a href="#use-backup-voting-servers-to-replace-lost-voters">Using backup voting servers to replace lost voters</a>

### Spread Servers Across Infrastructure Availability Zones
The cloud or on-premise infrastructure underlying your [Consul datacenter](/consul/docs/install/glossary#datacenter) can run across multiple availability zones.

The cloud or on-premise infrastructure underlying your [Consul datacenter](/consul/docs/install/glossary#datacenter)
may be split into several "availability zones".
An availability zone is meant to share no points of failure with other zones by:
- Having power, cooling, and networking systems independent from other zones
- Being physically distant enough from other zones so that large-scale disruptions
Expand All @@ -79,25 +74,25 @@ To distribute your Consul servers across availability zones, modify your infrast

Additionally, you should leverage resources that can automatically restore your compute instance,
such as autoscaling groups, virtual machine scale sets, or compute engine autoscaler.
The autoscaling resources can be customized to re-deploy servers into specific availability zones
and ensure the desired numbers of servers are available at all time.
Customize autoscaling resources to re-deploy servers into specific availability zones and ensure the desired numbers of servers are available at all times.

### Add More Voting Servers
### Quorum size

For most production use cases, we recommend using either 3 or 5 voting servers,
For most production use cases, we recommend using a minimum quorum of either 3 or 5 voting servers,
yielding a server-level fault tolerance of 1 or 2 respectively.

Even though it would improve fault tolerance,
adding voting servers beyond 5 is **not recommended** because it decreases Consul's performance—
it requires Consul to involve more servers in every state change or consistent read.

Consul Enterprise provides a way to improve fault tolerance without this performance penalty:
[using backup voting servers to replace lost voters](#use-backup-voting-servers-to-replace-lost-voters).
Consul Enterprise users can use redundancy zones to improve fault tolerance without this performance penalty.

### Redundancy zones <EnterpriseAlert inline />

### <EnterpriseAlert inline /> Use Backup Voting Servers to Replace Lost Voters
Use Consul Enterprise [redundancy zones](/consul/docs/enterprise/redundancy) to improve fault tolerance without the performance penalty of increasing the number of voting servers.

Consul Enterprise [redundancy zones](/consul/docs/enterprise/redundancy)
can be used to improve fault tolerance without the performance penalty of increasing the number of voting servers.
![Reference architecture diagram for Consul Redundancy zones](/img/architecture/consul-redundancy-zones-light.png#light-theme-only)
![Reference architecture diagram for Consul Redundancy zones](/img/architecture/consul-redundancy-zones-dark.png#dark-theme-only)

Each redundancy zone should be assigned 2 or more Consul servers.
If all servers are healthy, only one server per redundancy zone will be an active voter;
Expand Down Expand Up @@ -132,3 +127,51 @@ For more information on redundancy zones, refer to:
for a more detailed explanation
- [Redundancy zone tutorial](/consul/tutorials/enterprise/redundancy-zones)
to learn how to use them

### Autopilot

Autopilot is a set of functions that introduce servers to a cluster, cleans up dead servers, and monitors the state of the Raft protocol in the Consul cluster.

When you enable Autopilot's dead server cleanup, Autopilot marks failed servers as `Left` and removes them from the Raft peer set to prevent them from interfering with the quorum size. Autopilot does that as soon as a replacement Consul server comes online. This behavior is beneficial when server nodes failed and have been redeployed but Consul considers them as new nodes because their IP address and hostnames have changed. Autopilot keeps the cluster peer set size correct and the quorum requirement simple.

To illustrate the Autopilot advantage, consider a scenario where Consul has a cluster of five server nodes. The quorum is three, which means the cluster can lose two server nodes before the cluster fails. The following events happen:

1. Two server nodes fail.
1. Two replacement nodes are deployed with new hostnames and IPs.
1. The two replacement nodes rejoin the Consul cluster.
1. Consul treats the replacement nodes as extra nodes, unrelated to the previously failed nodes.

_With Autopilot not enabled_, the following happens:

1. Consul does not immediately clean up the failed nodes when the replacement nodes join the cluster.
1. The cluster now has the three surviving nodes, the two failed nodes, and the two replacement nodes, for a total of seven nodes.
- The quorum is increased to four, which means the cluster can only afford to lose one node until after the two failed nodes are deleted in seventy-two hours.
- The redundancy level has decreased from its initial state.

_With Autopilot enabled_, the following happens:

1. Consul immediately cleans up the failed nodes when the replacement nodes join the cluster.
1. The cluster now has the three surviving nodes and the two replacement nodes, for a total of five nodes.
- The quorum stays at three, which means the cluster can afford to lose two nodes before it fails.
- The redundancy level remains the same.

### Cluster peering

Linking multiple Consul clusters together to provide service redundancy is the most effective method to prevent disruption from failure. This method is enhanced when you design individual Consul clusters with resilience in mind. Consul clusters interconnect in two ways: WAN federation and cluster peering. We recommend using cluster peering whenever possible.

Cluster peering lets you connect two or more independent Consul clusters using mesh gateways, so that services can communicate between non-identical partitions in different datacenters.

![Reference architecture diagram for Consul cluster peering](/img/architecture/cluster-peering-diagram-light.png#light-theme-only)
![Reference architecture diagram for Consul cluster peering](/img/architecture/cluster-peering-diagram-dark.png#dark-theme-only)

Cluster peering is the preferred way to interconnect clusters because it is operationally easier to configure and manage than WAN federation. Cluster peering communication between two datacenters runs only on one port on the related Consul mesh gateway, which makes it operationally easy to expose for routing purposes.

When you use cluster peering to connect admin partitions between datacenters, use Consul’s dynamic traffic management functionalities `service-splitter`, `service-router` and `service-failover` to configure your service mesh to automatically forward or failover service traffic between peer clusters. Consul can then manage the traffic intended for the service and do [failover](/consul/docs/connect/config-entries/service-resolver#spec-failover), [load-balancing](/consul/docs/connect/config-entries/service-resolver#spec-loadbalancer), or [redirection](/consul/docs/connect/config-entries/service-resolver#spec-redirect).

Cluster peering also extends service discovery across different datacenters independent of service mesh functions. After you peer datacenters, you can refer to services between datacenters with `<service>.virtual.peer.consul` in Consul DNS. For Consul Enterprise, your query string may need to include the namespace, partition, or both. Refer to the [Consul DNS documentation](/consul/docs/services/discovery/dns-static-lookups#service-virtual-ip-lookups) for details on building virtual service lookups.

For more information on cluster peering, refer to:
- [Cluster peering documentation](/consul/docs/connect/cluster-peering)
for a more detailed explanation
- [Cluster peering tutorial](/consul/tutorials/implement-multi-tenancy/cluster-peering)
to learn how to implement cluster peering
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading