Skip to content

Commit 1940384

Browse files
committed
Add blog incident report for week starting 2025-01-20
1 parent 0f3b1fe commit 1940384

File tree

3 files changed

+164
-0
lines changed

3 files changed

+164
-0
lines changed
147 KB
Loading
131 KB
Loading
Lines changed: 164 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
---
2+
date: 2025-01-24
3+
title: "Global Outage: Elevated 500 error levels"
4+
linkTitle: "Global Outage: Elevated 500 error levels"
5+
description: "Hachyderm experienced elevated 500 error levels in mid-January 2025. Here is the postmortem, analysis, and writeup on the incident."
6+
author: Emelia Smith ([@thisismissem](https://hachyderm.io/@thisismissem))
7+
resources:
8+
- src: "**.{png,jpg}"
9+
title: "Image #:counter"
10+
params:
11+
byline: "Photo: Kris Nóva"
12+
---
13+
14+
On Monday, January 20th, 2025 at roughly 17:00 UTC Hachyderm started to experience elevated 500 error levels during routine system maintenance.
15+
16+
On Tuesday, January 21st, 2025 at roughly 17:20 UTC Hachyderm started to again experience elevated 500 error levels. Hachyderm appeared to still be up, but there were reports of outages.
17+
18+
Impacted users experienced 500 internal server error responses from `https://hachyderm.io` in all regions of the world.
19+
20+
## Context
21+
22+
As part of routine updates, we went through each of our machines and ran regular system updates.
23+
24+
Two of the updated packages included `libicu` and `libvips`. The affected machines involved were the `mastodon-web` machines, known as `fritz` and `buttons`, and `mastodon-sidekiq` machines, known as `franz` and `freud`. `nietzsche` is our database server, which we are in the process of replacing.
25+
26+
## Timeline
27+
28+
All events are documented in [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) time.
29+
30+
### 2025-01-20
31+
- **15:38** `@esk` Started doing routine system updates
32+
- **16:35** `@esk` Ran system updates on `buttons`
33+
- **16:43** `@esk` Rebooted `buttons`, and questioned if we'd have issues due to the `libicu` update, since this is compiled into ruby dependencies for Mastodon
34+
- **16:44** `@esk` Noted that `buttons` seemed okay
35+
- **16:45** `@esk` Started system updates on `fritz`
36+
- **16:57** `@esk` Rebooted `fritz`
37+
- **16:59** `@esk` Noted that whilst `fritz` was down for maintenance, requests did not correctly failover to `buttons`
38+
- **17:00** `@ThisIsMissEm` Responded saying that failover should have happened based on our CDN configuration, but for some reason the CDN didn't see `fritz` as being down
39+
- **17:18** `@esk` Noted that something didn't look good whilst attempting to run system updates on `franz`. Logs were showing a database connection error:
40+
```
41+
[novix@franz]: ~$ sudo journalctl -xeu mastodon-sidekiq-ingress@1
42+
Jan 20 17:18:41 franz bundle[600]: 2025-01-20T17:18:41.618Z pid=600 tid=2pgw class=ActivityPub::ProcessingWorker jid=b29a74a7be10bcbcdba588be elapsed=30.137 INFO: fail
43+
Jan 20 17:18:41 franz bundle[600]: 2025-01-20T17:18:41.618Z pid=600 tid=2pgw WARN: {"context":"Job raised exception","job":{"retry":8,"queue":"ingress","backtrace":true,"args":[1125751588830572xx,"{\"@context\":[\"https://www.w3 Jan 20 17:18:41 franz bundle [600]: 2025-01-20T17:18:41.618Z pid=600 tid=2pgw WARN: ActiveRecord::DatabaseConnectionError: There is an issue connecting with your hostname: nietzsche. Jan 20 17:18:41 franz bundle [600]: Please check your database configuration and ensure there is a valid connection to your database.
44+
Jan 20 17:18:41 franz bundle [600]: 2025-01-20T17:18:41.618Z pid=600 tid=2pgw WARN: /var/lib/mastodon-versions/v4.3.3/vendor/bundle/ruby/3.3.0/gems/activerecord-7.1.5.1/lib/active_record/connection_adapters/postgresql_adapter.rb:7 Jan 20 17:18:41 franz bundle [600]: /var/lib/mastodon-versions/v4.3.3/vendor/bundle/ruby/3.3.0/gems/activerecord-7.1.5.1/lib/active_record/connection_adapters/postgresql_adapter.rb:68:in
45+
`new_client'
46+
Jan 20 17:18:41 franz bundle[600]: /var/lib/mastodon-versions/v4.3.3/vendor/bundle/ruby/3.3.0/gems/activerecord-7.1.5.1/lib/active_record/connection_adapters/postgresql_adapter.rb:980:in `connect' Jan 20 17:18:41 franz bundle[600]: 2025-01-20T17:18:41.919Z pid=600 tid=650 INFO: Shutting down
47+
Jan 20 17:18:41 franz systemd[1]: Stopping mastodon-sidekiq...
48+
```
49+
50+
The specific part to note is:
51+
```
52+
WARN: ActiveRecord::DatabaseConnectionError: There is an issue connecting with your hostname: nietzsche.
53+
Jan 20 17:18:41 franz bundle [600]: Please check your database configuration and ensure there is a valid connection to your database.
54+
```
55+
- **17:19** An incident call was started between `@esk` and `@ThisIsMissEm`
56+
- **17:20** `@esk` Tested the connectivity to nietzsche, our database server, and connectivity was fine.
57+
- **17:20** `@esk` Noted a suspicion it might be because of ruby/icu update.
58+
- **17:20** `@esk` Performed a reinstall of ruby on the affected machine (`franz`)
59+
- **17:25** After reinstall `franz` appeared to start up sidekiq fine.
60+
- **17:30** Some investigation happened into a warning that was being generated:
61+
```
62+
(process:24889): VIPS-WARNING **: 17:30:23.963: unable to load "/usr/lib/vips-modules-8.16/vips-poppler.so" -- libpoppler-glib.so.8: cannot open shared object file: No such file or directory
63+
(process:24889): VIPS-WARNING **: 17:30:23.963: unable to load "/usr/lib/vips-modules-8.16/vips-openslide.so" -- libopenslide.so.1: cannot open shared object file: No such file or directory
64+
```
65+
- **17:35** `@esk` Installed the missing packages for `libpoppler` and `libopenslide` resolving the warnings.
66+
- **17:43** `@ThisIsMissEm` responded that these errors had actually always been present, as we didn't use poppler or openslide with libvips, so hadn't installed them.
67+
- **17:59** Discussion between `@ThisIsMissEm` and `@esk` about the failover not happening correctly.
68+
- Something was making the CDN nodes not detect that the `mastodon-web` nodes were down
69+
- Was identified to be because of caching configuration existing on `fritz` and `buttons` nginx, which had existed when `fritz` was still exposed directly to the internet (we'd moved everything behind CDN back in November/December as part of our [Global Resiliency Plan]({{< ref "../posts/2024-11-infrastructure-updates" >}}))
70+
- This caching on `fritz` and `buttons` meant that even though they were down, some percentage of requests were still responding from the cache as 200 responses, so CDN nodes didn't detect they were down.
71+
- **18:00** Initial incident call ended
72+
- **18:30** First confirmation, and first report of 500 responses globally
73+
- **18:40** `@esk` Decided to reinstall ruby on all mastodon-web machines.
74+
- **19:00** `@esk` Noted that `buttons` wasn't starting up correctly after reinstalling ruby, reporting the same database connection error as the `franz` earlier.
75+
- **19:00** New incident response call started
76+
- **19:40** After reinstalling ruby on `buttons` and `fritz`, the logs looked fine, and the alert stopped firing.
77+
78+
### 2025-01-21:
79+
- **15:15** Alerts started firing for elevated 500 error levels, `@ThisIsMissEm` was the only person about at the time, but hadn't been receiving notifications for the alerts.
80+
- **17:20** Community member reported seeing 500 errors when attempting to use Hachyderm, which `@ThisIsMissEm` saw.
81+
- **17:21** `@ThisIsMissEm` confirmed outage via grafana:
82+
![Graph showing 5xx errors](./image-01.png)
83+
<br /><br />
84+
- **17:23** `@ThisIsMissEm` noticed that there were database connection errors on `buttons` too:
85+
```
86+
Jan 21 15:51:55 buttons bundle[807963]: E, [2025-01-21T15:51:55.658538 #807963] ERROR -- : [65517bcd-64a2-4cd8-922e-440466a8e499]
87+
Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] ActiveRecord::DatabaseConnectionError (There is an issue connecting with your hostname: nietzsche.
88+
Jan 21 15:51:55 buttons bundle[807963]: Please check your database configuration and ensure there is a valid connection to your database.
89+
Jan 21 15:51:55 buttons bundle[807963]: ):
90+
Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499]
91+
Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/models/domain_block.rb:73:in `rule_for'
92+
Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/models/domain_block.rb:49:in `suspend?'
93+
Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/helpers/domain_control_helper.rb:16:in `domain_not_allowed?'
94+
Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/controllers/concerns/signature_verification.rb:223:in `actor_from_key_id'
95+
Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/controllers/concerns/signature_verification.rb:56:in `signed_request_actor'
96+
Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] app/controllers/concerns/signature_verification.rb:20:in `require_actor_signature!'
97+
Jan 21 15:51:55 buttons bundle[807963]: [65517bcd-64a2-4cd8-922e-440466a8e499] lib/mastodon/rack_middleware.rb:9:in `call'
98+
Jan 21 15:51:55 buttons bundle[807848]: I, [2025-01-21T15:51:55.675758 #807848] INFO -- : [a4191d7b-c2c6-4233-9ef0-e97b43886729] method=GET path=/api/v2/notifications/policy format=*/*
99+
```
100+
- **17:23** `@ThisIsMissEm` started investigating if we actually did have a database outage, since it couldn't be related to the updates performed the day before as we'd "fixed" that issue.
101+
- **17:25** `@ThisIsMissEm` reached out to other Mastodon admins to see if they had encountered this error, since the database appeared to be reachable. That led to the suggestion to check the error logs on the database servers just in case.
102+
- **17:45** `@ThisIsMissEm` discovered that `pgbouncer` on `nietzsche` was actually logging errors, but no alerts were firing.
103+
```
104+
Jan 21 15:51:56 host pgbouncer[8342]: 2025-01-21 15:51:56.202 UTC [8342] ERROR accept() failed: Too many open files
105+
```
106+
This was the cause of the error from the previous day:
107+
```
108+
WARN: ActiveRecord::DatabaseConnectionError: There is an issue connecting with your hostname: nietzsche.
109+
Jan 20 17:18:41 franz bundle [600]: Please check your database configuration and ensure there is a valid connection to your database.
110+
```
111+
The database was indeed actually down and unreachable, because mastodon-web and mastodon-sidekiq couldn't acquire a connection via pgbouncer.
112+
- **17:45-18:45** Team discussed the appropriate solution to increase available file handles to pgbouncer, this wasn't something `@ThisIsMissEm` knew how to resolve on her own, and documentation for how to increase these limits in systemd was confusing.
113+
114+
In the systemd file, you need to do:
115+
```
116+
LimitNOFILE=3000:524288
117+
```
118+
which then gets printed as:
119+
```
120+
LimitNOFILE=524288
121+
LimitNOFILESoft=3000
122+
```
123+
- **17:50** Outage confirmed to be resolved
124+
125+
## Root Cause
126+
127+
An error that first appeared during system updates was misinterpreted to be caused by the system updates, rather than a separate system actually being in a failure mode. The database load balancer, pgbouncer had been configured for 2000 connections, but only 1024 max file handles. Hachyderm's recent infrastructure changes had resulted in the maximum potential database connections raising to 1428.
128+
129+
The 500 errors were actually database errors, and not related to the system updates. It just so happened that when we were performing the upgrades, we also experienced elevated traffic levels, which pushed us over the 1024 connections that pgbouncer could support.
130+
131+
![Graph of HTTP Requests to Hachyderm](./image-02.png)
132+
133+
### Things that went well
134+
135+
TODO
136+
137+
### Things that went poorly
138+
139+
Multiple different issues were happening simultaneously:
140+
- Increased HTTP requests resulting in database connection exhaustion
141+
- System updates to packages known to have caused issues in the past
142+
- CDN failover not happening correctly
143+
144+
There was a lack of alerts on pgbouncer errors which resulted in the database issue not being discovered in a timely fashion.
145+
146+
We didn't check the grafana dashboards for 500 errors, which would have indicated that the outage actually started slightly before the updates were performed.
147+
148+
### Opportunities
149+
150+
We need to improve our monitoring and alerting on critical systems.
151+
152+
We need to improve our nginx configuration to ensure failover works correctly
153+
154+
We need to review our pgbouncer and postgresql configuration
155+
156+
## Resulting Action
157+
158+
#### 1) Alerts added for pgbouncer errors by `@ThisIsMissEm`
159+
160+
Alerts for error loglines on pgbouncer were added.
161+
162+
#### 2) Spreadsheet created to correctly calculate database connections required `@ThisIsMissEm`
163+
164+
This will be published after review from the Hachyderm team. We'd previously manually run the mathematics set out in `@hazelweakly`'s fantastic [Scaling Mastodon](https://hazelweakly.me/blog/scaling-mastodon/#db_pool-notes-from-nora's-blog) blog post.

0 commit comments

Comments
 (0)