Skip to content

Commit e6e5836

Browse files
committed
smooth out formatting, add screenshot
1 parent 4adfde0 commit e6e5836

File tree

2 files changed

+47
-29
lines changed

2 files changed

+47
-29
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,39 +1,40 @@
11
---
2-
title: "Troubleshooting upgrades"
2+
title: "Troubleshooting 4.x upgrades"
33
linkTitle: "Troubleshooting upgrades"
44
weight: 50
55
aliases:
66
-
77
description: >
8-
What to do when CHT 4.x upgrades get stuck
8+
What to do when CHT 4.x upgrades don't work as planned
99
relatedContent: >
1010
hosting/4.x/data-migration
1111
---
1212

13-
With 4.x well into a mature stage as 4.0.0 was released in November of 2022, Medic has learned a number of important lessons on how to unstick 4.x upgrades that get stuck. Below are some specific tips as well as general practices on upgrading 4.x.
13+
4.0.0 was released in November of 2022 so 4.x is now well into a mature and Medic has learned a number of important lessons on how to unstick 4.x upgrades that get stuck. Below are some specific tips as well as general practices on upgrading 4.x.
1414

1515
{{% pageinfo %}}
1616
All tips apply to both [Docker]({{< relref "hosting/4.x/production/docker" >}}) and [Kubernetes]({{< relref "hosting/4.x/production/kubernetes" >}}) based deployments unless otherwise specified.
1717

1818
All upgrades are expected to succeed without issue. Do not attempt any fixes unless you actively have a problem upgrading.
1919
{{% /pageinfo %}}
2020

21-
## Before you start
21+
## Considerations
2222

23-
tk - flesh out, but be prepared by:
23+
When troubleshooting, consider make sure there are:
2424

25-
* Have and have tested backups
26-
* Have extra disk space (up to 5x!)
27-
* Have tested the upgrade on a dev instance
28-
* ?
25+
* Backups exist and restores have been tested
26+
* Extra disk space is availabe (up to 5x!)
27+
* The upgrade has been tested on a development instance with production data
2928

3029
## A go-to fix: restart
3130

32-
A safe fix for any upgrade getting stuck is to restart all services. Any views that were being re-indexed will be picked up where they left off without loosing any work. This should be your first step when trouble shooting a stuck upgrade.
31+
A safe fix for any upgrade getting stuck is to restart all services. Any views that were being re-indexed will be picked up where they left off without loosing any work. This should be your first step when trouble shooting a stuck upgrade.
32+
33+
If you're able to, after a restart go back into the admin web GUI and try to upgrade again. Consider trying this at least twice.
3334

3435
## CHT 4.0.x - 4.3.x: CouchDB Crashes
3536

36-
**[issue](https://github.com/medic/cht-core/issues/9286)**: Starting an upgrade that involves view indexing can cause CouchDB to crash on large databases (>30m docs)
37+
**[Issue #9286](https://github.com/medic/cht-core/issues/9286)**: Starting an upgrade that involves view indexing can cause CouchDB to crash on large databases (>30m docs). The upgrade will fail and you will see the logs below when you have this issue.
3738

3839
HAProxy:
3940

@@ -52,16 +53,16 @@ CouchDB
5253
```
5354

5455
**Fix:**
55-
1. I'm checking that all the indexes are warmed by loading them one by one in fauxton.
56-
2. Restart all services, **retry** upgrade from Admin GUI (not cancel and upgrade)
56+
1. Check that all the indexes are warmed by loading them one by one in fauxton.
57+
2. Restart all services, **retry** upgrade from Admin GUI - do not cancel and upgrade.
5758

58-
## CHT 4.2.4 - 4.c.x: view indexing can become stuck after indexing is finished
59+
## CHT 4.0.0 - 4.2.2: view indexing can become stuck after indexing is finished
5960

60-
**[issue](https://github.com/medic/cht-core/issues/9617):** Starting an upgrade that involves view indexing can become stuck after indexing is finished
61+
**[Issue #9617](https://github.com/medic/cht-core/issues/9617):** Starting an upgrade that involves view indexing can become stuck after indexing is finished
6162

62-
upgrade process stalls after view indexes are built
63+
Upgrade process stalls while trying to index staged views:
6364

64-
tk - get screenshot of admin UI with no progress bar
65+
![CHT Core admin UI showing upgrade progress bar stalled at 4% ](stalled-upgrade.png)
6566

6667
**Fix:**
6768

@@ -71,14 +72,14 @@ Unfortunately, the workaround is manual and very technical and involves:
7172
* The admin upgrade page will say that the upgrade was interrupted, click retry upgrade.
7273
* Depending on the state of the database, you might see view indexing again. Depending on how many docs need to be indexed, indexing might get stuck again. Go back to 1 if that happens.
7374
* Eventually, when indexing jobs are short enough not to trigger a request hang, you will get the button to complete the upgrade.
74-
*
75+
7576
## CHT 4.0.1 - 4.9.0: CouchDB restart causes all services to go down
7677

7778
**Note** - This is a Docker only issue.
7879

79-
**[issue](https://github.com/medic/cht-core/issues/9284)**: A couchdb restart in single node docker takes down the whole instance.
80+
**[Issue #9284](https://github.com/medic/cht-core/issues/9284)**: A couchdb restart in single node docker takes down the whole instance. The upgrade will fail and you will see the logs below when you have this issue.
8081

81-
Haproxy continuously reports NOSRV errors like:
82+
Haproxy reports `NOSRV` errors:
8283

8384
```shell
8485
<150>Jul 25 18:11:03 haproxy[12]: 172.18.0.9,<NOSRV>,503,0,1001,0,GET,/,-,admin,'-',241,-1,-,'-'
@@ -96,26 +97,42 @@ nginx reports:
9697
2024/07/25 18:40:28 [error] 43#43: *5757 connect() failed (111: Connection refused) while connecting to upstream, client: 172.18.0.1,
9798
```
9899
99-
100100
**Fix:** Restart all services
101101
102102
103103
## CHT 4.x.x upgrade to 4.x.x - no more free disk space
104104
105-
[Issue](https://github.com/moh-kenya/config-echis-2.0/issues/2578#issuecomment-2455702112): prod instance couch is crashing, stuck at compaction initiation - escalated to MoH Team to resolve [lack of free disk space issue]
105+
**Issue\*:** Couch is crashing during upgrade. The upgrade will fail and you will see the logs below when you have this issue. While there's two log scenarios, both have the same fix.
106+
107+
CouchDB logs scenario 1:
106108
107-
tk - can't (re)start services during upgrade
109+
```shell
110+
[error] 2024-11-04T20:42:37.275307Z [email protected] <0.29099.2438> -------- rexi_server: from: [email protected](<0.3643.2436>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,682}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,140}]}]
111+
[error] 2024-11-04T20:42:37.275303Z [email protected] <0.10933.2445> -------- rexi_server: from: [email protected](<0.3643.2436>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,map_fold,3,[{file,"src/couch_mrview.erl"},{line,526}]},{couch_bt_engine,include_reductions,4,[{file,"src/couch_bt_engine.erl"},{line,1074}]},{couch_bt_engine,skip_deleted,4,[{file,"src/couch_bt_engine.erl"},{line,1069}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,848}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,819}]}]
112+
[error] 2024-11-04T20:42:37.275377Z [email protected] <0.7374.2434> -------- rexi_server: from: [email protected](<0.3643.2436>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,265}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,205}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,462}]},{couch_mrview,map_fold,3,[{file,"src/couch_mrview.erl"},{line,526}]},{couch_bt_engine,include_reductions,4,[{file,"src/couch_bt_engine.erl"},{line,1074}]},{couch_bt_engine,skip_deleted,4,[{file,"src/couch_bt_engine.erl"},{line,1069}]},{couch_btree,stream_kv_node2,8,[{file,"src/couch_btree.erl"},{line,848}]},{couch_btree,stream_kp_node,8,[{file,"src/couch_btree.erl"},{line,819}]}]
113+
```
114+
115+
CouchDB logs scenario 2:
116+
117+
```shell
118+
[info] 2024-11-04T20:18:46.692239Z [email protected] <0.6832.4663> -------- Starting compaction for db "shards/7ffffffe-95555552/medic-user-mikehaya-meta.1690191139" at 10
119+
[info] 2024-11-04T20:19:47.821999Z [email protected] <0.7017.4653> -------- Starting compaction for db "shards/7ffffffe-95555552/medic-user-marnyakoa-meta.1690202463" at 21
120+
[info] 2024-11-04T20:21:24.529822Z [email protected] <0.24125.4661> -------- Starting compaction for db "shards/7ffffffe-95555552/medic-user-lilian_lubanga-meta.1690115504" at 15
121+
```
108122
109123
**Fix:** Give CouchDB more disk and Restart all services
110124
125+
_* See eCHIS Kenya [Issue #2578](https://github.com/moh-kenya/config-echis-2.0/issues/2578#issuecomment-2455702112) - a private repo and not available to the public_
111126
112-
## CHT 4.2.x upgrade to 4.11 - kubernetes has pods stuck in indeterminate state
127+
128+
## CHT 4.2.x upgrade to 4.11 - Kubernetes has pods stuck in indeterminate state
113129
114130
**Note** - This is a Kubernetes only issue.
115131
116-
[Issue](https://github.com/moh-kenya/config-echis-2.0/issues/2579#issuecomment-2455637516): A number of pods were stuck in indeterminate state, presumably because of failed garbage collection
132+
**Issue\*:** A number of pods were stuck in indeterminate state, presumably because of failed garbage collection
133+
134+
API Logs:
117135
118-
API Logs
119136
```shell
120137
2024-11-04 19:33:56 ERROR: Server error: StatusCodeError: 500 - {"message":"Error: Can't upgrade right now.
121138
The following pods are not ready...."}
@@ -127,7 +144,8 @@ Running `kubectl get po` shows 3 pods with status of `ContainerStatusUnknown`:
127144
128145
**Fix:** delete pods so they get recreated and start cleanly
129146
130-
(tk - is this syntax legal/correct?)
131-
132-
`kubectl delete po 'cht.service in (api, sentinel, haproxy, couchdb)'`
147+
```shell
148+
kubectl delete po 'cht.service in (api, sentinel, haproxy, couchdb)'
149+
```
133150
151+
_* See eCHIS Kenya [Issue #2579](https://github.com/moh-kenya/config-echis-2.0/issues/2579#issuecomment-2455637516) - a private repo and not available to the public_
Loading

0 commit comments

Comments
 (0)