-
-
Notifications
You must be signed in to change notification settings - Fork 629
wfe/ra: Periodically load rate limit overrides from the database #8407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… into NewTransactionBuilder
|
@jprenken, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values. |
|
@jprenken, this PR adds one or more new feature flags: OverridesFromDB. As such, this PR must be accompanied by a review of the Let's Encrypt CP/CPS to ensure that our behavior both before and after this flag is flipped is compliant with that document. Please conduct such a review, then add your findings to the PR description in a paragraph beginning with "CPS Compliance Review:". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some initial style/organizational comments, but I also have a high-level concern. We've talked before about how blocking the process on a database query might become an issue, for instance if we are restarting a datacenter and the RA comes up before the database. It would be nice to not have an ordering dependency on startup; everything can just come up whenever and eventually become healthy. I see there's some code to backoff and retry the SA queries, but I suspect it will max out pretty quickly.
But there's also an issue of how long it takes even in normal operation to load the rate limit overrides. We've generally assumed these are going to be pretty fast, because there's not that many of them. But noticing that the RA is streaming results from the SA made me think again about how the list of overrides is only going to grow forever. Also, the growth rate is somewhat out of our control if we auto approve some requests. So we could wind up where an RA takes several seconds or potentially even minutes to start up, because it's loading overrides.
As a related issue: NewTransactionBuilder() does an SA RPC internally. In general I tend to assume constructor-like things run pretty fast and don't block on external services. How about this?
NewTransactionBuilder()returns quickly, with aTransactionBuilderthat has no overrides loaded.TransactionBuilder.NewRefresherattempts its first DB load immediately.- All DB loads have some amount of backoff and retry on error.
TransactionBuildergets a new method.Ready()that returns true if it has successfully loaded overrides once.- The RA's health check return true if:
TransactionBuilder.Ready()returns true, OR- N seconds have passed since startup.
That last point allows us to avoid serving without overrides data on most reloads, but also ensure that a failure to load the overrides doesn't completely prevent the RA from serving. And since we'll still have the refresher running, the RA gets additional chances to load overrides if the first one fails.
|
|
||
| // loadDefaults marshals the defaults YAML file at path into a map of limits. | ||
| func loadDefaults(path string) (LimitConfigs, error) { | ||
| // loadDefaultsFromFile marshals the defaults YAML file at path into a map of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // loadDefaultsFromFile marshals the defaults YAML file at path into a map of | |
| // loadDefaultsFromFile unmarshals the defaults YAML file at path into a map of |
Preexisting issue, but this is unmarshaling, not marshaling. Same comment for loadOverridesFromFile below.
| // hydrateOverrideLimit validates the limit Name, values, and override bucket | ||
| // key. It returns the correct bucket key to use in-memory. It should be called | ||
| // when loading overrides from the database. | ||
| // | ||
| // When the OverridesFromDB feature flag is off, parseOverrideLimits is used as | ||
| // this method's equivalent. | ||
| func hydrateOverrideLimit(bucketKey string, limit *Limit) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to avoid the duplication of code between this function and parseOverrideLimits. Also, the need to explain which function gets called based on the feature flag suggests that something is off in the design and we should be able to separate concerns more.
It seems like this function operates on one bucket key and one limit object, but mostly does the same thing as some of the inner loop in parseOverrideLimits. Can we have that inner loop in parseOverrideLimits just call this?
Reading that inner loop I see that it can't just pass a *Limit because the limit isn't constructed until we have all its fields. That's a nice pattern that we'd like to preserve when we can.
I notice this function only accesses limit.Name from its limit parameter. If we pass bucketKey, limitName string, perhaps that allows us to reuse this code cleanly? It also allows us to use the "don't build until you have all fields"1 pattern in the database code path.
(this would also require moving the ValidateLimit call out into calling code, which I think is the correct move anyhow).
Footnotes
-
another way to phrase this might be "don't create objects in invalid states". ↩
| // loadOverrides replaces this registry's overrides with a new dataset. This is | ||
| // separate from the goroutine in NewRefresher(), for ease of testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // loadOverrides replaces this registry's overrides with a new dataset. This is | |
| // separate from the goroutine in NewRefresher(), for ease of testing. | |
| // loadOverrides replaces this registry's overrides with a new dataset. |
It's a reasonable and natural choice for NewRefresher to call a function. The comment doesn't need to explain why this function isn't inlined elsewhere, and doing so actually makes it more confusing.
|
|
||
| // loadOverrides replaces this registry's overrides with a new dataset. This is | ||
| // separate from the goroutine in NewRefresher(), for ease of testing. | ||
| func (l *limitRegistry) loadOverrides(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only caller has a //nolint:errcheck, which suggests that returning the error is pointless since nothing uses it.
However, I think this actually suggests a nicer solution: instead of logging from within .loadOverrides, simply return an error (appropriately prefixed), and let the caller log any errors from this function. That means collapsing the distinction between Errf and Warning but I think that's okay. Probably Errf is appropriate for both.
| registry, err := newLimitRegistry(defaults, nil) | ||
| // NewTransactionBuilder returns a new *TransactionBuilder. A defaults map is | ||
| // required. | ||
| func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) { | |
| func NewTransactionBuilder(defaults LimitConfigs, refresher OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) { |
overrides is a bit of a misnomer since this is not the overrides themselves but a means to get them.
| registry := &limitRegistry{ | ||
| defaults: regDefaults, | ||
| refreshOverrides: overrides, | ||
| stats: stats, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unnecessary to store the stats prometheus.Registerer in the limitRegistry object because we've already registered everything we need to.
| // NewTransactionBuilder returns a new *TransactionBuilder. A defaults map is | ||
| // required. | ||
| func NewTransactionBuilder(defaults LimitConfigs, overrides OverridesRefresher, stats prometheus.Registerer, logger blog.Logger) (*TransactionBuilder, error) { | ||
| regDefaults, err := parseDefaultLimits(defaults) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing issue, but what does reg mean in regDefaults? Looking at existing code, I suspect it's a mistake, since it's adjacent to regOverrides. I'd call this maybe processedDefaults. Or call the input defaultConfigs and this defaults, since that's the name of the field it gets assigned to.
Add a
refreshOverridesfunc to theratelimits.limitRegistrystruct. Instead of populating the staticoverridesfield once when creating an instance of the struct, call the new func at startup and then every 30 minutes.Emit relevant logs and metrics from
limitRegistry.Add an
OverridesFromDBfeature flag to read overrides from the DB instead of a file.Flatten
newLimitRegistry.*()methods' logic into their sole caller,NewTransactionBuilder().Rename
loadDefaults()&loadOverrides(), appendingFromFilefor clarity/consistency.test: Add ra-sct-provider dependency on SA.
Important for deployment: If the
OverridesFromDBfeature flag is enabled, starting the RA now depends on the SA. The RA must be added as a gRPC client ofsa.StorageAuthorityReadOnly.Fixes #8382