Skip to content

Conversation

@jprenken
Copy link
Contributor

@jprenken jprenken commented Sep 24, 2025

Add a refreshOverrides func to the ratelimits.limitRegistry struct. Instead of populating the static overrides field once when creating an instance of the struct, call the new func at startup and then every 30 minutes.

Emit relevant logs and metrics from limitRegistry.

Add an OverridesFromDB feature flag to read overrides from the DB instead of a file.

Flatten newLimitRegistry.*() methods' logic into their sole caller, NewTransactionBuilder().

Rename loadDefaults() & loadOverrides(), appending FromFile for clarity/consistency.

test: Add ra-sct-provider dependency on SA.

Important for deployment: If the OverridesFromDB feature flag is enabled, starting the RA now depends on the SA. The RA must be added as a gRPC client of sa.StorageAuthorityReadOnly.

Fixes #8382

@jprenken jprenken marked this pull request as ready for review October 22, 2025 06:41
@jprenken jprenken requested a review from a team as a code owner October 22, 2025 06:41
@jprenken jprenken requested a review from jsha October 22, 2025 06:41
@github-actions
Copy link
Contributor

@jprenken, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

@github-actions
Copy link
Contributor

@jprenken, this PR adds one or more new feature flags: OverridesFromDB. As such, this PR must be accompanied by a review of the Let's Encrypt CP/CPS to ensure that our behavior both before and after this flag is flipped is compliant with that document.

Please conduct such a review, then add your findings to the PR description in a paragraph beginning with "CPS Compliance Review:".

Copy link
Contributor

@jsha jsha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some initial style/organizational comments, but I also have a high-level concern. We've talked before about how blocking the process on a database query might become an issue, for instance if we are restarting a datacenter and the RA comes up before the database. It would be nice to not have an ordering dependency on startup; everything can just come up whenever and eventually become healthy. I see there's some code to backoff and retry the SA queries, but I suspect it will max out pretty quickly.

But there's also an issue of how long it takes even in normal operation to load the rate limit overrides. We've generally assumed these are going to be pretty fast, because there's not that many of them. But noticing that the RA is streaming results from the SA made me think again about how the list of overrides is only going to grow forever. Also, the growth rate is somewhat out of our control if we auto approve some requests. So we could wind up where an RA takes several seconds or potentially even minutes to start up, because it's loading overrides.

As a related issue: NewTransactionBuilder() does an SA RPC internally. In general I tend to assume constructor-like things run pretty fast and don't block on external services. How about this?

  • NewTransactionBuilder() returns quickly, with a TransactionBuilder that has no overrides loaded.
  • TransactionBuilder.NewRefresher attempts its first DB load immediately.
  • All DB loads have some amount of backoff and retry on error.
  • TransactionBuilder gets a new method .Ready() that returns true if it has successfully loaded overrides once.
  • The RA's health check return true if:
  • TransactionBuilder.Ready() returns true, OR
  • N seconds have passed since startup.

That last point allows us to avoid serving without overrides data on most reloads, but also ensure that a failure to load the overrides doesn't completely prevent the RA from serving. And since we'll still have the refresher running, the RA gets additional chances to load overrides if the first one fails.


// loadDefaults marshals the defaults YAML file at path into a map of limits.
func loadDefaults(path string) (LimitConfigs, error) {
// loadDefaultsFromFile marshals the defaults YAML file at path into a map of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// loadDefaultsFromFile marshals the defaults YAML file at path into a map of
// loadDefaultsFromFile unmarshals the defaults YAML file at path into a map of

Preexisting issue, but this is unmarshaling, not marshaling. Same comment for loadOverridesFromFile below.

Comment on lines +274 to +280
// hydrateOverrideLimit validates the limit Name, values, and override bucket
// key. It returns the correct bucket key to use in-memory. It should be called
// when loading overrides from the database.
//
// When the OverridesFromDB feature flag is off, parseOverrideLimits is used as
// this method's equivalent.
func hydrateOverrideLimit(bucketKey string, limit *Limit) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid the duplication of code between this function and parseOverrideLimits. Also, the need to explain which function gets called based on the feature flag suggests that something is off in the design and we should be able to separate concerns more.

It seems like this function operates on one bucket key and one limit object, but mostly does the same thing as some of the inner loop in parseOverrideLimits. Can we have that inner loop in parseOverrideLimits just call this?

Reading that inner loop I see that it can't just pass a *Limit because the limit isn't constructed until we have all its fields. That's a nice pattern that we'd like to preserve when we can.

I notice this function only accesses limit.Name from its limit parameter. If we pass bucketKey, limitName string, perhaps that allows us to reuse this code cleanly? It also allows us to use the "don't build until you have all fields"1 pattern in the database code path.

(this would also require moving the ValidateLimit call out into calling code, which I think is the correct move anyhow).

Footnotes

  1. another way to phrase this might be "don't create objects in invalid states".

Comment on lines +389 to +390
// loadOverrides replaces this registry's overrides with a new dataset. This is
// separate from the goroutine in NewRefresher(), for ease of testing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// loadOverrides replaces this registry's overrides with a new dataset. This is
// separate from the goroutine in NewRefresher(), for ease of testing.
// loadOverrides replaces this registry's overrides with a new dataset.

It's a reasonable and natural choice for NewRefresher to call a function. The comment doesn't need to explain why this function isn't inlined elsewhere, and doing so actually makes it more confusing.


// loadOverrides replaces this registry's overrides with a new dataset. This is
// separate from the goroutine in NewRefresher(), for ease of testing.
func (l *limitRegistry) loadOverrides(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only caller has a //nolint:errcheck, which suggests that returning the error is pointless since nothing uses it.

However, I think this actually suggests a nicer solution: instead of logging from within .loadOverrides, simply return an error (appropriately prefixed), and let the caller log any errors from this function. That means collapsing the distinction between Errf and Warning but I think that's okay. Probably Errf is appropriate for both.

registry, err := newLimitRegistry(defaults, nil)
// NewTransactionBuilder returns a new *TransactionBuilder. A defaults map is
// required.
func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {
func NewTransactionBuilder(defaults LimitConfigs, refresher OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {

overrides is a bit of a misnomer since this is not the overrides themselves but a means to get them.

registry := &limitRegistry{
defaults: regDefaults,
refreshOverrides: overrides,
stats: stats,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unnecessary to store the stats prometheus.Registerer in the limitRegistry object because we've already registered everything we need to.

// NewTransactionBuilder returns a new *TransactionBuilder. A defaults map is
// required.
func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) {
regDefaults, err := parseDefaultLimits(defaults)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-existing issue, but what does reg mean in regDefaults? Looking at existing code, I suspect it's a mistake, since it's adjacent to regOverrides. I'd call this maybe processedDefaults. Or call the input defaultConfigs and this defaults, since that's the name of the field it gets assigned to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wfe/ra: Periodically load rate limit overrides from the database

2 participants