TTL cache retries more frequently on failures by yizzlez · Pull Request #806 · airbnb/chronon

yizzlez · 2024-08-07T20:30:15Z

Summary

This PR adds a new parameter to the TTL cache -- failureTTLMillis. This is a custom TTL for entries with type Failure. Currently failureTTLMillis == ttlMillis, which means this should have no behavior change. At Stripe, we will be changing some of the caches to have a significantly shorter failureTTLMillis.

Why / Goal

At Stripe, we ran into an incident involving this particular piece of caching code for groupByServing info.

Our internal KV store was returning a handful of timeout errors for some requests. The issue is that this Failure is stored in a TTL Cache with a timeout of 2 hours. This caused hosts to enter a bad state and error repeatedly, as it was fetching the previously stored error from the TTLCache instead of retrying against our KVStore.

With this code change, it's possible to configure this TTL cache to have a failureTTLMillis == 5s, in which case we will automatically retry after 5s if we encounter a KV store timeout error.

Test Plan

Added Unit Tests
Covered by existing CI
Integration tested

Checklist

Documentation update

Reviewers

* Test with current thread executor

nikhilsimha · 2024-08-07T20:57:09Z

-                     refreshIntervalMillis: Long = 8 * 1000 // 8 seconds
+                     refreshIntervalMillis: Long = 8 * 1000, // 8 seconds
+                     // same as ttlMillis, so behavior is unchanged barring an override
+                     failureTTLMillis: Long = 2 * 60 * 60 * 1000 // 2 hours


I think we should lower this to 30 seconds - I am no longer at airbnb, but I think it would benefit airbnb too. we have had incidents in the past, similar to yours where this could have helped.

cc: @pengyu-hou who is familiar with the Airbnb incident.

Hmm, I feel 30 seconds may be too frequent as default. Some of these TTLCache is used to cache metadata, which doesn't get updated frequently anyway.

But I can see why for things like BatchIr cache, a more frequent failure refresh is desired.

Thanks @nikhilsimha. Our incident was caused by a stale but valid metadata. To mitigate it, we would have to flush the TTL cache. This should be addressed with @yuli-han 's recent work that we will only fetch active configs.

I am curious what is the failureTTLMillis from Stripe side? @yizzlez

For failure cases, I agree that we should use a lower TTL.

nikhilsimha · 2024-08-07T21:06:36Z

+      val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis)
+      val shouldUpdate = entry.value match {
+        // Encountered a failure, update according to failure TTL.
+        case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL
+        case _ => nowFunc() - entry.updatedAtMillis > intervalMillis
+      }


Suggested change

val minFailureUpdateTTL = Math.min(intervalMillis, failureTTLMillis)

val shouldUpdate = entry.value match {

// Encountered a failure, update according to failure TTL.

case Failure(_) => nowFunc() - entry.updatedAtMillis > minFailureUpdateTTL

case _ => nowFunc() - entry.updatedAtMillis > intervalMillis

}

val effectiveExpiry = entry.map(_ => intervalMillis).getOrElse(Math.min(intervalMillis, failureTTLMillis))

minor simplification.

nikhilsimha · 2024-08-07T21:07:11Z

+      }
      if (
-        (nowFunc() - entry.updatedAtMillis > intervalMillis) &&
+        shouldUpdate &&


Suggested change

shouldUpdate &&

(nowFunc() - entry.updatedAtMillis > effectiveExpiry) &&

nikhilsimha · 2024-08-07T21:07:32Z

@@ -0,0 +1,138 @@
+package ai.chronon.online.test


Thanks a lot for adding this!

nikhilsimha

thanks for the change!

TTL cache retries more frequently on failures

db2126d

* Test with current thread executor

yizzlez force-pushed the yizhao--ttl-failure-cache-changes branch from 8396cae to db2126d Compare August 7, 2024 20:34

nikhilsimha reviewed Aug 7, 2024

View reviewed changes

nikhilsimha approved these changes Aug 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TTL cache retries more frequently on failures#806

TTL cache retries more frequently on failures#806
yizzlez wants to merge 1 commit intomainfrom
yizhao--ttl-failure-cache-changes

yizzlez commented Aug 7, 2024

Uh oh!

nikhilsimha Aug 7, 2024 •

edited

Loading

Uh oh!

hzding621 Aug 7, 2024

Uh oh!

pengyu-hou Aug 9, 2024

Uh oh!

nikhilsimha Aug 7, 2024

Uh oh!

nikhilsimha Aug 7, 2024

Uh oh!

nikhilsimha Aug 7, 2024

Uh oh!

nikhilsimha Aug 7, 2024

Uh oh!

pengyu-hou Aug 9, 2024

Uh oh!

nikhilsimha left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	shouldUpdate &&
	(nowFunc() - entry.updatedAtMillis > effectiveExpiry) &&

Conversation

yizzlez commented Aug 7, 2024

Summary

Why / Goal

Test Plan

Checklist

Reviewers

Uh oh!

nikhilsimha Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hzding621 Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

pengyu-hou Aug 9, 2024

Choose a reason for hiding this comment

Uh oh!

nikhilsimha Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

nikhilsimha Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

nikhilsimha Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

nikhilsimha Aug 7, 2024

Choose a reason for hiding this comment

Uh oh!

pengyu-hou Aug 9, 2024

Choose a reason for hiding this comment

Uh oh!

nikhilsimha left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nikhilsimha Aug 7, 2024 •

edited

Loading