wfe/ra: Periodically load rate limit overrides from the database #8407

jprenken · 2025-09-24T03:53:53Z

Add a refreshOverrides func to the ratelimits.limitRegistry struct. Instead of populating the static overrides field once when creating an instance of the struct, call the new func at startup and then every 30 minutes.

Emit relevant logs and metrics from limitRegistry.

Add an OverridesFromDB feature flag to read overrides from the DB instead of a file.

Flatten newLimitRegistry.*() methods' logic into their sole caller, NewTransactionBuilder().

Rename loadDefaults() & loadOverrides(), appending FromFile for clarity/consistency.

test: Add ra-sct-provider dependency on SA.

Important for deployment: If the OverridesFromDB feature flag is enabled, starting the RA now depends on the SA. The RA must be added as a gRPC client of sa.StorageAuthorityReadOnly.

Fixes #8382

… into NewTransactionBuilder

…routine

…XMEs

github-actions · 2025-10-22T06:41:45Z

@jprenken, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

github-actions · 2025-10-22T06:41:57Z

@jprenken, this PR adds one or more new feature flags: OverridesFromDB. As such, this PR must be accompanied by a review of the Let's Encrypt CP/CPS to ensure that our behavior both before and after this flag is flipped is compliant with that document.

Please conduct such a review, then add your findings to the PR description in a paragraph beginning with "CPS Compliance Review:".

jsha

I had some initial style/organizational comments, but I also have a high-level concern. We've talked before about how blocking the process on a database query might become an issue, for instance if we are restarting a datacenter and the RA comes up before the database. It would be nice to not have an ordering dependency on startup; everything can just come up whenever and eventually become healthy. I see there's some code to backoff and retry the SA queries, but I suspect it will max out pretty quickly.

But there's also an issue of how long it takes even in normal operation to load the rate limit overrides. We've generally assumed these are going to be pretty fast, because there's not that many of them. But noticing that the RA is streaming results from the SA made me think again about how the list of overrides is only going to grow forever. Also, the growth rate is somewhat out of our control if we auto approve some requests. So we could wind up where an RA takes several seconds or potentially even minutes to start up, because it's loading overrides.

As a related issue: NewTransactionBuilder() does an SA RPC internally. In general I tend to assume constructor-like things run pretty fast and don't block on external services. How about this?

NewTransactionBuilder() returns quickly, with a TransactionBuilder that has no overrides loaded.
TransactionBuilder.NewRefresher attempts its first DB load immediately.
All DB loads have some amount of backoff and retry on error.
TransactionBuilder gets a new method .Ready() that returns true if it has successfully loaded overrides once.
The RA's health check return true if:
TransactionBuilder.Ready() returns true, OR
N seconds have passed since startup.

That last point allows us to avoid serving without overrides data on most reloads, but also ensure that a failure to load the overrides doesn't completely prevent the RA from serving. And since we'll still have the refresher running, the RA gets additional chances to load overrides if the first one fails.

jsha · 2025-10-22T21:18:34Z

ratelimits/limit.go


-// loadDefaults marshals the defaults YAML file at path into a map of limits.
-func loadDefaults(path string) (LimitConfigs, error) {
+// loadDefaultsFromFile marshals the defaults YAML file at path into a map of


Suggested change

// loadDefaultsFromFile marshals the defaults YAML file at path into a map of

// loadDefaultsFromFile unmarshals the defaults YAML file at path into a map of

Preexisting issue, but this is unmarshaling, not marshaling. Same comment for loadOverridesFromFile below.

jsha · 2025-10-22T21:40:27Z

ratelimits/limit.go

+// hydrateOverrideLimit validates the limit Name, values, and override bucket
+// key. It returns the correct bucket key to use in-memory. It should be called
+// when loading overrides from the database.
+//
+// When the OverridesFromDB feature flag is off, parseOverrideLimits is used as
+// this method's equivalent.
+func hydrateOverrideLimit(bucketKey string, limit *Limit) (string, error) {


I'd like to avoid the duplication of code between this function and parseOverrideLimits. Also, the need to explain which function gets called based on the feature flag suggests that something is off in the design and we should be able to separate concerns more.

It seems like this function operates on one bucket key and one limit object, but mostly does the same thing as some of the inner loop in parseOverrideLimits. Can we have that inner loop in parseOverrideLimits just call this?

Reading that inner loop I see that it can't just pass a *Limit because the limit isn't constructed until we have all its fields. That's a nice pattern that we'd like to preserve when we can.

I notice this function only accesses limit.Name from its limit parameter. If we pass bucketKey, limitName string, perhaps that allows us to reuse this code cleanly? It also allows us to use the "don't build until you have all fields"¹ pattern in the database code path.

(this would also require moving the ValidateLimit call out into calling code, which I think is the correct move anyhow).

Footnotes

another way to phrase this might be "don't create objects in invalid states". ↩

jsha · 2025-10-22T21:54:10Z

ratelimits/limit.go

+// loadOverrides replaces this registry's overrides with a new dataset. This is
+// separate from the goroutine in NewRefresher(), for ease of testing.


Suggested change

// loadOverrides replaces this registry's overrides with a new dataset. This is

// separate from the goroutine in NewRefresher(), for ease of testing.

// loadOverrides replaces this registry's overrides with a new dataset.

It's a reasonable and natural choice for NewRefresher to call a function. The comment doesn't need to explain why this function isn't inlined elsewhere, and doing so actually makes it more confusing.

jsha · 2025-10-22T22:55:33Z

ratelimits/limit.go


+// loadOverrides replaces this registry's overrides with a new dataset. This is
+// separate from the goroutine in NewRefresher(), for ease of testing.
+func (l *limitRegistry) loadOverrides(ctx context.Context) error {


The only caller has a //nolint:errcheck, which suggests that returning the error is pointless since nothing uses it.

However, I think this actually suggests a nicer solution: instead of logging from within .loadOverrides, simply return an error (appropriately prefixed), and let the caller log any errors from this function. That means collapsing the distinction between Errf and Warning but I think that's okay. Probably Errf is appropriate for both.

jsha · 2025-10-22T23:10:54Z

ratelimits/transaction.go

-	registry, err := newLimitRegistry(defaults, nil)
+// NewTransactionBuilder returns a new *TransactionBuilder. A defaults map is
+// required.
+func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {


Suggested change

func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {

func NewTransactionBuilder(defaults LimitConfigs, refresher OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {

overrides is a bit of a misnomer since this is not the overrides themselves but a means to get them.

jsha · 2025-10-22T23:12:05Z

ratelimits/transaction.go

+	registry := &limitRegistry{
+		defaults:         regDefaults,
+		refreshOverrides: overrides,
+		stats:            stats,


It's unnecessary to store the stats prometheus.Registerer in the limitRegistry object because we've already registered everything we need to.

jsha · 2025-10-22T23:13:50Z

ratelimits/transaction.go

+// NewTransactionBuilder returns a new *TransactionBuilder. A defaults map is
+// required.
+func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {
+	regDefaults, err := parseDefaultLimits(defaults)


Pre-existing issue, but what does reg mean in regDefaults? Looking at existing code, I suspect it's a mistake, since it's adjacent to regOverrides. I'd call this maybe processedDefaults. Or call the input defaultConfigs and this defaults, since that's the name of the field it gets assigned to.

jprenken added 30 commits September 23, 2025 20:53

wfe/ra: Periodically load rate limit overrides from the database

658e1a5

Fixes #8382

Merge branch 'main' of github.com:letsencrypt/boulder into rlo-reload

fe965f2

Add flags OverridesFromDB & GhostOverrides; collapse newLimitRegistry…

7235514

… into NewTransactionBuilder

Plumb stats & logger into limiter; improve comments

038c083

Use stats & logger; address SA dependency in RA

23f11ac

Merge branch 'main' of github.com:letsencrypt/boulder into rlo-reload

f8b1bf8

Merge branch 'main' of github.com:letsencrypt/boulder into rlo-reload

e6f463c

Remove GhostOverrides; retry overrides instead

56f631c

Change OverridesFromDB from config flag to feature flag

1d5ec0e

Move override refresh timeout to DB only

f44de6d

Change metric to gauge, fix comment & formatting

49ecec2

Add FIXME

f0bfdb6

Merge branch 'main' of github.com:letsencrypt/boulder into rlo-reload

08d8fda

Replicate parseOverrideLimits for loading overrides from DB

bd83d1c

Metrics & logs for override hydration errors

6ca822d

Clarity, concision, pedantry & a hard-boiled egg

a586cda

Add overridesPerLimit metrics

0087e23

Simplify/generalize language, move metric setting out of refresher go…

0f8beee

…routine

Merge branch 'main' of github.com:letsencrypt/boulder into rlo-reload

b40243b

Merge branch 'main' of github.com:letsencrypt/boulder into rlo-reload

70313aa

Change overridesPerLimit to GaugeVec; test loadOverrides; add test FI…

60b6c3b

…XMEs

Precompute overrides at loadOverrides time

bd1aded

Merge branch 'main' of github.com:letsencrypt/boulder into rlo-reload

b7c534c

Merge branch 'main' of github.com:letsencrypt/boulder into rlo-reload

843602a

Extend tests

8edd5d5

TestHydrateOverrideLimit

fb77f44

Refine tests

b09ba07

Fix isOverride; TestNewTransactionBuilderFromDatabase

0c75df4

Merge branch 'main' of github.com:letsencrypt/boulder into rlo-reload

4fd2876

Soothe the linter

83f2ad4

jprenken marked this pull request as ready for review October 22, 2025 06:41

jprenken requested a review from a team as a code owner October 22, 2025 06:41

jprenken requested a review from jsha October 22, 2025 06:41

jsha requested changes Oct 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

wfe/ra: Periodically load rate limit overrides from the database #8407

wfe/ra: Periodically load rate limit overrides from the database #8407

Uh oh!

jprenken commented Sep 24, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

jsha left a comment

Uh oh!

jsha Oct 22, 2025

Uh oh!

jsha Oct 22, 2025

Uh oh!

jsha Oct 22, 2025

Uh oh!

jsha Oct 22, 2025

Uh oh!

jsha Oct 22, 2025

Uh oh!

jsha Oct 22, 2025

Uh oh!

jsha Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// loadDefaultsFromFile marshals the defaults YAML file at path into a map of
	// loadDefaultsFromFile unmarshals the defaults YAML file at path into a map of

		// loadOverrides replaces this registry's overrides with a new dataset. This is
		// separate from the goroutine in NewRefresher(), for ease of testing.

	func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {
	func NewTransactionBuilder(defaults LimitConfigs, refresher OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {

Uh oh!

wfe/ra: Periodically load rate limit overrides from the database #8407

Are you sure you want to change the base?

wfe/ra: Periodically load rate limit overrides from the database #8407

Uh oh!

Conversation

jprenken commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

jsha left a comment

Choose a reason for hiding this comment

Uh oh!

jsha Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

jsha Oct 22, 2025

Choose a reason for hiding this comment

Footnotes

Uh oh!

jsha Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

jsha Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

jsha Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

jsha Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

jsha Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jprenken commented Sep 24, 2025 •

edited

Loading