Migrate docs.rs to RDS and ECS #353

jdno · 2023-10-13T09:07:30Z

syphar · 2023-10-14T09:12:35Z

another thing we need to figure out:

how would the database migrations be run in the deploy process?

syphar · 2023-10-14T09:42:24Z

after checking our NGINX config there is a second piece we need to solve somehow:

IP blocks.

Every now and then we have a mis-acting crawler and in these cases we blocked the source IP in NGINX on our server.

I would prefer to have this in AWS / CloudFront if possible.

Otherwise we would add this to our web container, probably configured via environment variable?

syphar · 2023-10-14T10:38:37Z

Next piece we need before prod :

Access to logs

jdno · 2023-10-16T11:17:43Z

For blocking IPs, we should just set up a web-application firewall (AWS WAF). I actually think that we already have one set up for docs.rs, but I'm not 100% sure.

Access to the logs is a good point! It probably makes sense to stream all logs to a central place, whether that's CloudWatch or an external tool like Datadog.

meysam81 · 2023-11-20T14:26:16Z

@jdno Please let me know if you need a hand with any of the items in this list 🙂

syphar · 2023-12-01T08:59:44Z

@jdno coming from this discussion I want to add here that the docs.rs containers / servers should not be reachable directly from the internet. So all traffic needs to go through CloudFront & AWS WAF

syphar · 2024-03-01T10:12:41Z

One thought I had thinking about this topic again:

CloudFront has a hard limit on in-progress wildcard path invalidations (15)
we are invalidating the crate docs after each build

from rust-lang/docs.rs#1871 (comment)

Looking at https://docs.rs/releases/activity it seems we average at least 600 releases per day. If an average invalidation takes 5 minutes and we can have 15 in parallel, that's 3 invalidations per minute throughput. With 1440 minutes in a day, we could handle up to 4320 builds per day before we wind up in unbounded growth land. Of course, that's based on a significant assumption about how long an invalidation takes.

I'm not sure if we can / should handle invalidations differently, but we might think about using fastly when we rework the infra?

Mark-Simulacrum · 2024-03-01T17:44:45Z

Can't we de-duplicate invalidations if we approach the limit? E.g., a * invalidation every 5 minutes would presumably never hit the limit. Not sure how that would affect cache hit rates, but I'd expect designing around not needing invalidations or being ok with fairly blanket invalidations to be a good long-term strategy.

(I think we've had this conversation elsewhere before).

syphar · 2024-03-01T18:32:07Z

Can't we de-duplicate invalidations if we approach the limit? E.g., a * invalidation every 5 minutes would presumably never hit the limit.

You mean "escalating" them, so when the queue is too long, just convert the queue into a full purge.
This is definitely would work, but would mean that the user experience (especially outside the US) is worse until the cache is fuller again. Of course this might be acceptible for us.

being ok with fairly blanket invalidations

This also means that the backend always has to be capable to handle the full uncached load, and higher egress costs depending on how often we have to do the full purge.

I also remember a discussion at EuroRust that we could think about having additional docs.rs webservers (also readonly DB & local bucket?) in some regions (europe?).

I'd expect designing around not needing invalidations

You're right, this is a valid discussion to have. I imagine this would only work when the URLs would include something like the build-number in the URL, and replace the more generic URLs rest with redirects. If I'm not missing something this would revert some of the SEO & URL work from rust-lang/docs.rs#1438 (introducing /latest/ URLs ). And then people would start linking specific docs-builds in their sites as they did before we had /latest/.

(I think we've had this conversation elsewhere before).

you're probably right :)

I wanted to bring it up here as a point, for when we migrate infra anyways.

Mark-Simulacrum · 2024-03-07T02:52:59Z

Note that (IMO) if we can get the cache keys setup right, i.e. everything except HTML is always at a by-hash file path - it seems to me that /latest/ can just be served with a short ttl (5 minutes), perhaps with stale-while-revalidate. That means that there's a small period where it's not necessarily consistent what version you get from it across all pages if some are cached locally and some aren't (and likewise for CDN), but I don't see any real problem with that. Users mostly won't even notice.

Yes, especially anything out of s3 can definitely be replicated if we need it to be into multiple regions pretty easily. This just causes issues while you need invalidations since you're racing against replication which can itself take some time (hours IIRC for the cheap option and minutes for the costly one?)

syphar · 2024-03-07T18:18:14Z

Note that (IMO) if we can get the cache keys setup right, i.e. everything except HTML is always at a by-hash file path - it seems to me that /latest/ can just be served with a short ttl (5 minutes), perhaps with stale-while-revalidate. That means that there's a small period where it's not necessarily consistent what version you get from it across all pages if some are cached locally and some aren't (and likewise for CDN), but I don't see any real problem with that. Users mostly won't even notice.

Jep, everything except HTML should have already have hashed filenames, with some small exceptions.
For HTML I (personally) would prefer longer caching duration, 5 minutes outdated is probably fine, not sure now much that would reduce the user happiness for some crates. I'll probably try to get better data on how the cache coverage for certain crates is at some point and see in more detail how the impact would be on users. And it might also be the case that it's just me that needs these kind of response times for docs :)

Yes, especially anything out of s3 can definitely be replicated if we need it to be into multiple regions pretty easily. This just causes issues while you need invalidations since you're racing against replication which can itself take some time (hours IIRC for the cheap option and minutes for the costly one?)

That's good to know, thanks!

jdno added this to infra-team Oct 24, 2022

jdno converted this from a draft issue Oct 13, 2023

syphar mentioned this issue Oct 14, 2023

web-server: add compression rust-lang/docs.rs#2271

Open

jdno moved this from Backlog to In Progress in infra-team Oct 16, 2023

jdno self-assigned this Oct 16, 2023

jdno removed their assignment Dec 2, 2024

jdno moved this from In Progress to Ready in infra-team Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate docs.rs to RDS and ECS #353

Migrate docs.rs to RDS and ECS #353

jdno commented Oct 13, 2023 •

edited

Loading

syphar commented Oct 14, 2023

syphar commented Oct 14, 2023

syphar commented Oct 14, 2023

jdno commented Oct 16, 2023

meysam81 commented Nov 20, 2023

syphar commented Dec 1, 2023

syphar commented Mar 1, 2024

Mark-Simulacrum commented Mar 1, 2024

syphar commented Mar 1, 2024

Mark-Simulacrum commented Mar 7, 2024

syphar commented Mar 7, 2024

Migrate docs.rs to RDS and ECS #353

Migrate docs.rs to RDS and ECS #353

Comments

jdno commented Oct 13, 2023 • edited Loading

Questions

syphar commented Oct 14, 2023

syphar commented Oct 14, 2023

syphar commented Oct 14, 2023

jdno commented Oct 16, 2023

meysam81 commented Nov 20, 2023

syphar commented Dec 1, 2023

syphar commented Mar 1, 2024

Mark-Simulacrum commented Mar 1, 2024

syphar commented Mar 1, 2024

Mark-Simulacrum commented Mar 7, 2024

syphar commented Mar 7, 2024

jdno commented Oct 13, 2023 •

edited

Loading