Skip to content

Conversation

@sighphyre
Copy link
Member

@sighphyre sighphyre commented Oct 24, 2025

Fixes a few interesting and subtle bugs in the feature refresh code. Previously we were always adding 5 seconds onto the poll interval for fetching features, so a poll interval of 10 seconds was actually 15 seconds. We were also not ever accounting for latency to upstream so our polling interval was more like poll_interval + latency + 5 seconds

@sighphyre sighphyre changed the base branch from main to 2-3970 October 24, 2025 14:10
@github-actions
Copy link

github-actions bot commented Oct 24, 2025

⚠️ Deprecation Warning: The deny-licenses option is deprecated for possible removal in the next major release. For more information, see issue 997.

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

pub async fn start_refresh_features_background_task(
refresher: Arc<FeatureRefresher>,
refresh_state_rx: Receiver<RefreshState>,
mut rx: Receiver<RefreshState>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very small change but you can just declare your param as mut, it's the same as the mutable shadow that was previously being used here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we still call it refresh_state_rx for readability? To know at a glance what it's receiving.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, good suggestion!

let failure_count: u32 = min(self.failure_count + 1, 10);
let now = Utc::now();
let next_refresh = calculate_next_refresh(now, *refresh_interval, failure_count as u64);
let base = self.next_refresh.unwrap_or(now);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would previously update the next check time to be now + poll interval but this is called after we've polled so this was causing the poll interval to drift by the latency to upstream.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our thought earlier was that this was fine, because 100/200ms delay for a refresh was no biggie.

}

impl TokenRefreshStatus for TokenRefreshSet {
fn next_due_refresh(&self) -> Option<Instant> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Find time that the next token needs to be refreshed. If there's a token that's never been refreshed, it should be refreshed now, although I'm pretty sure that can't happen right now


let due_now = refresher.tokens_to_refresh.get_tokens_due_for_refresh();
if !due_now.is_empty() {
for token in due_now {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still icky. I'd honestly like this to move this to a Future that's composed of a never-ending stream of token batches to refresh but that breaks a whole bunch of the backoff code (which incidentally would probably be easier to handle as a a Future that's composed of a never-ending stream of token batches to refresh). That's too big to do in a bugfix PR imo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think this is a part of rust/code maturity, what you're describing sounds somewhat similar to what we would've done if we were to write it now.. It feels like a signal/effect system wanting to break out... :P

// but our modelling is a little wonky here
.unwrap_or_else(|| Instant::now() + Duration::from_secs(5));

tokio::select! {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we're in a nice place. We either have a sleep until next duration or we're in a sleeping state pending being woken up. If either of these futures completes (we're woken or the timeout expires), we can immediately wake up and do our thang

mut rx: Receiver<RefreshState>,
) {
let mut rx = refresh_state_rx;
let mut alive = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably fine, but we're assuming, right? However if we wanted to be super sure, couldn't we do something like let mut alive = *rx.borrow() == RefreshState::Running;?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard my suggestion, I see now that this has a different meaning. The refresher will always start alive.

if rx.changed().await.is_err() {
alive = false;
}
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're alive and paused, won't this loop right away, indefinitely? Probably okay, but just asking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should block on this rx.changed().await until we get a change notification

Copy link
Member

@nunogois nunogois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this from New to Approved PRs in Issues and PRs Oct 24, 2025
let now = chrono::Utc::now();
self.iter()
.filter_map(|e| e.value().next_refresh)
.map(|t| (t - now).to_std().unwrap_or(Duration::from_secs(0)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the map to std here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, std implementes Ord

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, but so does Chrono's TimeDelta.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it matters that much, just curious to learn why you chose to map it.

Base automatically changed from 2-3970 to main October 27, 2025 11:16
Copy link
Member

@chriswk chriswk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a 👍 from me as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Approved PRs

Development

Successfully merging this pull request may close these issues.

3 participants