Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Count Downloads Using CDN Logs #372

Closed
8 tasks done
jdno opened this issue Dec 11, 2023 · 2 comments
Closed
8 tasks done

Count Downloads Using CDN Logs #372

jdno opened this issue Dec 11, 2023 · 2 comments
Assignees

Comments

@jdno
Copy link
Member

jdno commented Dec 11, 2023

Problem

crates.io counts downloads by crate and version. This is currently done as part of the /download endpoint, which counts the download and then redirects the caller to the Content Delivery Networks (CDNs) for static.crates.io, from where the actual file is downloaded.

sequenceDiagram
	User->>crates.io: Requests crate
	crates.io->>crates.io: Counts crate and version
	crates.io->>User: Redirects user to static.crates.io
	User->>static.crates.io: Requests crate
	static.crates.io->>User: Serves crate file
Loading

Due to the volume of requests to the /download endpoint, counting the crate and its version in the application has a significant performance cost. Especially when traffic spikes, the application can struggle to keep up with requests, which in the worst case can cause a service outage.

Goal

Key Objectives

  1. Avoid hitting the web app for every crate download
  2. Continue to count downloads by crate and version

Desired Outcome

In the ideal scenario, we avoid hitting the web app for download requests altogether and go straight to the CDNs. We can achieve this by changing the dl field in the index's config.json to point to the CDN instead of the application. Full compatibility with existing behavior requires to rewrite some URLs, which has already been implemented.

sequenceDiagram
	User->>static.crates.io: Requests crate
	static.crates.io->>User: Serves crate file
Loading

The CDNs could attempt to count download, but this is difficult because the CDNs are globally distributed. There is no single point that receives all the traffic, so download counts would need to be processed and merges somewhere else. That system would quickly face the same performance issues that crates.io currently faces.

We can use the request logs from the CDNs to count downloads in an asynchronous way. The CDNs produce a single log line per request. These logs are collected and uploaded periodically to a dedicated S3 bucket as a compressed archive.

Whenever a new archive is uploaded to the bucket, S3 can push an event into a SQS queue. crates.io can monitor the queue and pull incoming events. From the event, it can determine what files to fetch from S3, download and then parse them, and update the download counts in the database.

sequenceDiagram
	static.crates.io ->> S3: Uploads logs
	S3 ->> SQS: Queues event
	crates.io ->> SQS: Pulls event from queue
	crates.io ->> S3: Fetches new log file
	crates.io ->> crates.io: Parses log file
	crates.io ->> crates.io: Updates download counts
Loading

Benefits

  • Logs are processed asynchronously and in batches. This reduces the load on the server, especially during traffic spikes.
  • Publishing events into SQS is natively supported on AWS and does not require any additional infrastructure (besides an SQS queue).
  • crates.io is already integrated with S3 to manage crates. Its access can easily be extended to grant access to the SQS queue as well as the logs bucket.
  • Monitoring the queue and pulling from SQS can be implemented within the existing crates.io codebase. Alternative solutions required additional infrastructure and configuration, which would have fragmented the codebase and made long-term maintenance more difficult.

Notes

  • Logs from CloudFront and Fastly use a different format.
  • Compressed archives are typically between 5-20MB in size.

Tasks

Infra-Team

  • Create a new AWS accounts for crates.io
  • Deploy the new infrastructure on staging
    • Create a new SQS queue
    • Grant the crates.io application access to the SQS queue
    • Grant crates.io team access to new account
    • Grant the crates.io application access to the S3 bucket with the logs
    • Enable publishing an event from S3 when a new archive is uploaded
  • Deploy the new infrastructure to production

crates.io

(Tracked by the crates.io team)

  • Create a job that monitors the SQS queue
  • Fetch and parse new log files
  • Update the counts in the database
  • Change dl field to point to the CDN

Resources

@jdno jdno added this to infra-team Dec 11, 2023
@github-project-automation github-project-automation bot moved this to Backlog in infra-team Dec 11, 2023
@jdno jdno moved this from Backlog to Ready in infra-team Dec 11, 2023
@jdno jdno self-assigned this Dec 19, 2023
@jdno jdno moved this from Ready to In Progress in infra-team Dec 19, 2023
jdno added a commit to jdno/rust-simpleinfra that referenced this issue Dec 19, 2023
We are planning[^1] to count crate downloads using CDN logs. This
requires new infrastructure, namely a SQS queue into which S3 can
publish events and that crates.io can monitor.

[^1]: rust-lang#372
jdno added a commit to jdno/rust-simpleinfra that referenced this issue Jan 17, 2024
We are working on using the logs from our CDNs to count crate downloads
on crates.io. Whenever a log archive is uploaded to the bucket, a
notification is sent to an SQS queue. crates.io then downloads the log,
parses it, and updates the download counts.

For this to work, crates.io needs access to the S3 bucket with the logs.
This change grants read-only access to individual log archives.

See rust-lang#372 for details.
jdno added a commit to jdno/rust-simpleinfra that referenced this issue Jan 17, 2024
We are working on using the logs from our CDNs to count crate downloads
on crates.io. Whenever a log archive is uploaded to the bucket, a
notification is sent to an SQS queue. crates.io then downloads the log,
parses it, and updates the download counts.

For this to work, crates.io needs access to the S3 bucket with the logs.
This change grants read-only access to individual log archives.

See rust-lang#372 for details.
jdno added a commit to jdno/rust-simpleinfra that referenced this issue Jan 17, 2024
We are working on using the logs from our CDNs to count crate downloads
on crates.io. Whenever a log archive is uploaded to the bucket, a
notification is sent to an SQS queue. crates.io then downloads the log,
parses it, and updates the download counts.

For this to work, crates.io needs access to the S3 bucket with the logs.
This change grants read-only access to individual log archives.

See rust-lang#372 for details.
jdno added a commit to jdno/rust-simpleinfra that referenced this issue Jan 18, 2024
The crates-io-prod account was recently created as part of the project
to count crate downloads using CDN logs (see rust-lang#372). Similar to all our
other AWS accounts, Datadog and Wiz have been installed in the account.
jdno added a commit to jdno/rust-simpleinfra that referenced this issue Jan 29, 2024
We are working on using the logs from our CDNs to count crate downloads
on crates.io. Whenever a log archive is uploaded to the bucket, a
notification is sent to an SQS queue. crates.io then downloads the log,
parses it, and updates the download counts.

For this to work, crates.io needs access to the S3 bucket with the logs.
This change grants read-only access to individual log archives.

See rust-lang#372 for details.
jdno added a commit to jdno/rust-simpleinfra that referenced this issue Jan 29, 2024
We are working on using the logs from our CDNs to count crate downloads
on crates.io. Whenever a log archive is uploaded to the bucket, a
notification is sent to an SQS queue. crates.io then downloads the log,
parses it, and updates the download counts.

For this to work, crates.io needs access to the S3 bucket with the logs.
This change grants read-only access to individual log archives.

See rust-lang#372 for details.
jdno added a commit to jdno/rust-simpleinfra that referenced this issue Jan 31, 2024
The infrastructure to count crate downloads using the CDN logs (see
issue rust-lang#372) has been deployed to production.
@jdno
Copy link
Member Author

jdno commented Jan 31, 2024

The infrastructure has been created and is ready for testing. I'll leave the issue open until we've confirmed that crates.io can access and process the logs.

@jdno jdno closed this as completed Feb 19, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in infra-team Feb 19, 2024
@Turbo87
Copy link
Member

Turbo87 commented Feb 19, 2024

for cross-linking purposes: more discussion on this on the crates.io side is at https://rust-lang.zulipchat.com/#narrow/stream/318791-t-crates-io/topic/download.20counting.20via.20CDN.20logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants