Conversation
* Test with current thread executor
8396cae to
db2126d
Compare
| refreshIntervalMillis: Long = 8 * 1000 // 8 seconds | ||
| refreshIntervalMillis: Long = 8 * 1000, // 8 seconds | ||
| // same as ttlMillis, so behavior is unchanged barring an override | ||
| failureTTLMillis: Long = 2 * 60 * 60 * 1000 // 2 hours |
There was a problem hiding this comment.
I think we should lower this to 30 seconds - I am no longer at airbnb, but I think it would benefit airbnb too. we have had incidents in the past, similar to yours where this could have helped.
cc: @pengyu-hou who is familiar with the Airbnb incident.
There was a problem hiding this comment.
Hmm, I feel 30 seconds may be too frequent as default. Some of these TTLCache is used to cache metadata, which doesn't get updated frequently anyway.
But I can see why for things like BatchIr cache, a more frequent failure refresh is desired.
There was a problem hiding this comment.
Thanks @nikhilsimha. Our incident was caused by a stale but valid metadata. To mitigate it, we would have to flush the TTL cache. This should be addressed with @yuli-han 's recent work that we will only fetch active configs.
I am curious what is the failureTTLMillis from Stripe side? @yizzlez
For failure cases, I agree that we should use a lower TTL.
| val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis) | ||
| val shouldUpdate = entry.value match { | ||
| // Encountered a failure, update according to failure TTL. | ||
| case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL | ||
| case _ => nowFunc() - entry.updatedAtMillis > intervalMillis | ||
| } |
There was a problem hiding this comment.
| val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis) | |
| val shouldUpdate = entry.value match { | |
| // Encountered a failure, update according to failure TTL. | |
| case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL | |
| case _ => nowFunc() - entry.updatedAtMillis > intervalMillis | |
| } | |
| val effectiveExpiry = entry.map(_ => intervalMillis).getOrElse(Math.min(intervalMillis, failureTTLMillis)) |
| } | ||
| if ( | ||
| (nowFunc() - entry.updatedAtMillis > intervalMillis) && | ||
| shouldUpdate && |
There was a problem hiding this comment.
| shouldUpdate && | |
| (nowFunc() - entry.updatedAtMillis > effectiveExpiry) && |
| @@ -0,0 +1,138 @@ | |||
| package ai.chronon.online.test | |||
There was a problem hiding this comment.
Thanks a lot for adding this!
nikhilsimha
left a comment
There was a problem hiding this comment.
thanks for the change!
Summary
This PR adds a new parameter to the TTL cache --
failureTTLMillis. This is a custom TTL for entries with typeFailure. CurrentlyfailureTTLMillis == ttlMillis, which means this should have no behavior change. At Stripe, we will be changing some of the caches to have a significantly shorterfailureTTLMillis.Why / Goal
At Stripe, we ran into an incident involving this particular piece of caching code for groupByServing info.
Our internal KV store was returning a handful of timeout errors for some requests. The issue is that this
Failureis stored in a TTL Cache with a timeout of 2 hours. This caused hosts to enter a bad state and error repeatedly, as it was fetching the previously stored error from the TTLCache instead of retrying against our KVStore.With this code change, it's possible to configure this TTL cache to have a
failureTTLMillis == 5s, in which case we will automatically retry after 5s if we encounter a KV store timeout error.Test Plan
Checklist
Reviewers