Skip to content

Commit c4764cc

Browse files
Add images to the Disaster recovery page (#2619)
1 parent d828266 commit c4764cc

File tree

8 files changed

+899
-17
lines changed

8 files changed

+899
-17
lines changed

modules/ROOT/images/disaster.svg

Lines changed: 103 additions & 0 deletions
Loading

modules/ROOT/images/fully-recovered-cluster.svg

Lines changed: 97 additions & 0 deletions
Loading

modules/ROOT/images/healthy-cluster.svg

Lines changed: 103 additions & 0 deletions
Loading

modules/ROOT/images/servers-cordoned-databases-moved.svg

Lines changed: 135 additions & 0 deletions
Loading

modules/ROOT/images/servers-cordoned.svg

Lines changed: 135 additions & 0 deletions
Loading

modules/ROOT/images/servers-deallocated.svg

Lines changed: 135 additions & 0 deletions
Loading

modules/ROOT/images/system-db-restored.svg

Lines changed: 117 additions & 0 deletions
Loading

modules/ROOT/pages/clustering/multi-region-deployment/disaster-recovery.adoc

Lines changed: 74 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,52 @@ You have to create a new cluster and restore the databases, see xref:clustering/
1919
== Faults in clusters
2020

2121
Databases in clusters may be allocated differently within the cluster and may also have different numbers of primaries and secondaries.
22+
23+
image::healthy-cluster.svg[width="400", title="A healthy cluster", role=popup]
24+
2225
The consequence of this is that all servers may be different in which databases they are hosting.
2326
Losing a server in a cluster may cause some databases to lose a member while others are unaffected.
2427
Therefore, in a disaster where one or more servers go down, some databases may keep running with little to no impact, while others may lose all their allocated resources.
2528

29+
Figure 2 shows the disaster when three servers are lost, demonstrating that this situation impacts databases in different ways.
30+
31+
image::disaster.svg[width="400", title="Example of a cluster disaster", role=popup]
32+
33+
.Disaster scenarios and recovery strategies
34+
[cols="1,2,2", options=header]
35+
|===
36+
^|Database
37+
^|Disaster scenario
38+
^|Recovery strategy
39+
40+
|Database A
41+
|All allocations are lost.
42+
|The database needs to be recreated from a backup since there are no available allocations left in the cluster.
43+
44+
|Database B
45+
|The primary allocation is lost, and the secondary allocation is available.
46+
|The database needs to be recreated since it has lost a majority of primary allocations and is therefore write-unvailable.
47+
However, the recreation can be based on the secondary allocation still present on a healthy server, so a backup is not required.
48+
The recreated database will be as up-to-date as the secondary allocation was at the time of the disaster.
49+
50+
|Database C
51+
|Two primary allocations and a secondary one are lost.
52+
|The database needs to be recreated since it has lost a majority of primary allocations and is therefore write-unavailable.
53+
However, the recreation can be based on the primary and secondary allocations still present on healthy servers, so a backup is not required.
54+
The recreated database will reflect the state of the most up-to-date surviving primary or secondary allocation.
55+
56+
|Database D
57+
|One primary allocation and two secondary allocations are lost.
58+
|The database remains write-available, allowing it to automatically move allocations from lost servers to available ones when the lost servers are deallocated.
59+
Therefore, the database does not need to be recreated even though some allocations have been lost.
60+
61+
|Database E
62+
|Stays unaffected.
63+
|None of the database's allocations were affected by the disaster, so no action is required.
64+
|===
65+
66+
Although databases C and D share the same topology, their primaries and secondaries are allocated differently, requiring distinct recovery strategies in this disaster example.
67+
2668
== Guide overview
2769
[NOTE]
2870
====
@@ -115,9 +157,6 @@ Use the following steps to regain write availability for the `system` database i
115157
They create a new `system` database from the most up-to-date copy of the `system` database that can be found in the cluster.
116158
It is important to get a `system` database that is as up-to-date as possible, so it corresponds to the view before the disaster closely.
117159

118-
.Guide
119-
[%collapsible]
120-
====
121160

122161
[NOTE]
123162
=====
@@ -133,6 +172,8 @@ This causes downtime for all databases in the cluster until the processes are st
133172
. For every _lost_ server, add a new *unconstrained* one according to xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster].
134173
It is important that the new servers are unconstrained, or deallocating servers in the next step of this guide might be blocked, even though enough servers were added.
135174
+
175+
In the current example, the new unconstrained servers are added in this step.
176+
+
136177
[NOTE]
137178
=====
138179
While recommended, it is not strictly necessary to add new servers in this step.
@@ -143,10 +184,12 @@ Be aware that not replacing servers can cause cluster overload when databases ar
143184
=====
144185
+
145186
. On each server, run `bin/neo4j-admin database load system --from-path=[path-to-dump] --overwrite-destination=true` to load the current `system` database dump.
187+
+
188+
image::system-db-restored.svg[width="400", title="The unconstrained servers are added and the `system` database is restored", role=popup]
189+
+
146190
. On each server, ensure that the discovery settings are correct.
147191
See xref:clustering/setup/discovery.adoc[Cluster server discovery] for more information.
148192
. Start the Neo4j process on all servers.
149-
====
150193

151194

152195
[[make-servers-available]]
@@ -180,16 +223,18 @@ This is done in two different steps:
180223
* Any allocations that cannot move by themselves require the database to be recreated so that they are forced to move.
181224
* Any allocations that can move will be instructed to do so by deallocating the server.
182225

183-
.Guide
184-
[%collapsible]
185-
====
226+
186227
. For each `Unavailable` server, run `CALL dbms.cluster.cordonServer("unavailable-server-id")` on one of the available servers.
187228
This prevents new database allocations from being moved to this server.
188-
. For each `Cordoned` server, make sure a new *unconstrained* server has been added to the cluster to take its place.
189-
See xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
190229
+
191-
If servers were added in the <<make-the-system-database-write-available, Make the `system` database write-available>> step of this guide, additional servers might not be needed here.
192-
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added.
230+
image::servers-cordoned.svg[width="400", title="Cordon unavailable servers", role=popup]
231+
+
232+
Figure 4 shows that new unconstrained servers have been added already.
233+
It was done in the <<make-the-system-database-write-available, Make the `system` database write-available>> step of this guide, and additional servers might not be needed here.
234+
235+
. If you have not yet added new *unconstrained* servers, add one for each `Cordoned` server that needs to be replaced.
236+
See xref:clustering/servers.adoc#cluster-add-server[Add a server to the cluster] for more information.
237+
It is important that the new servers are unconstrained, or deallocating servers might be blocked even though enough servers were added.
193238
+
194239
[NOTE]
195240
=====
@@ -229,10 +274,16 @@ If any database has `currentStatus` = `quarantined` on an available server, recr
229274
=====
230275
If you recreate databases using xref:database-administration/standard-databases/recreate-database.adoc#undefined-servers[undefined servers] or xref:database-administration/standard-databases/recreate-database.adoc#undefined-servers-backup[undefined servers with fallback backup], the store might not be recreated as up-to-date as possible in certain edge cases where the `system` database has been restored.
231276
=====
277+
+
278+
image::servers-cordoned-databases-moved.svg[width="400", title="All write-unavailable databases were recreated", role=popup]
232279

233280
. For each `Cordoned` server, run `DEALLOCATE DATABASES FROM SERVER cordoned-server-id` on one of the available servers.
234281
This will move all database allocations from this server to an available server in the cluster.
235282
+
283+
image::servers-deallocated.svg[width="400", title="Deallocate databases from unavailable servers", role=popup]
284+
+
285+
Note that the database D was still write-available, which means the allocations can be moved from lost servers to available ones when the lost servers are deallocated.
286+
+
236287
[NOTE]
237288
=====
238289
This operation might fail if enough unconstrained servers were not added to the cluster to replace lost servers.
@@ -241,8 +292,11 @@ Another reason is that some available servers are also `Cordoned`.
241292

242293
. For each deallocating or deallocated server, run `DROP SERVER deallocated-server-id`.
243294
This removes the server from the cluster's view.
244-
====
245-
295+
+
296+
image::fully-recovered-cluster.svg[width="400", title="The fully recovered cluster", role="popup"]
297+
+
298+
After dropping the deallocated servers, you still have to ensure that all moved and recreated databases are write-available.
299+
For this purpose, follow the steps <<write-available-databases-steps, below>>.
246300

247301
[[make-databases-write-available]]
248302
=== Make databases write-available
@@ -280,14 +334,14 @@ Instead, check that the primary is allocated on an available server and that it
280334
A stricter verification can be done to verify that all databases are in their desired states on all servers.
281335
For the stricter check, run `SHOW DATABASES` and verify that `requestedStatus` = `currentStatus` for all database allocations on all servers.
282336

337+
[[write-available-databases-steps]]
283338
==== Path to correct state
339+
284340
Use the following steps to make all databases in the cluster write-available again.
285341
They include recreating any databases that are not write-available and identifying any recreations that will not complete.
286342
Recreations might fail for different reasons, but one example is that the checksums do not match for the same transaction on different servers.
287343

288-
.Guide
289-
[%collapsible]
290-
====
344+
291345
. Identify all write-unavailable databases by running `CALL dbms.cluster.statusCheck([])` as described in the <<#example-verification, Example verification>> part of this disaster recovery step.
292346
Filter out all databases desired to be stopped, so that they are not recreated unnecessarily.
293347
. Recreate every database that is not write-available and has not been recreated previously.
@@ -308,4 +362,7 @@ Recreating a database will not complete if one of the following messages is disp
308362
** `No store found on any of the seeders ServerId1, ServerId2...`
309363
. For each database which will not complete recreation, recreate them from backup using xref:database-administration/standard-databases/recreate-database.adoc#uri-seed[Backup as seed].
310364

311-
====
365+
366+
367+
368+

0 commit comments

Comments
 (0)