Skip to content

Conversation

jgowdy-godaddy
Copy link
Contributor

@jgowdy-godaddy jgowdy-godaddy commented Aug 3, 2025

Summary

  • Change default cache eviction policy from "simple" (unbounded) to "lru" (bounded)
  • Prevents unbounded memory growth in production systems
  • Maintains cache size limit of 1000 entries by default

The Problem

The current default cache implementation uses simpleCache which:

  • Never evicts keys
  • Grows without bounds
  • Can cause OOM in production systems with many partitions/users

The code even has a comment warning about this:

// It offers improved performance in scenarios where lockable memory is not a concern.
// Conversely, it is not recommended for memory bound systems, as it does not evict keys.

The Solution

This PR changes the default eviction policy from empty string (which becomes "simple") to "lru":

  1. Added constants:

    DefaultKeyCacheEvictionPolicy    = "lru"
    DefaultSessionCacheEvictionPolicy = "slru"
  2. Updated NewCryptoPolicy to set these defaults

  3. Updated documentation to reflect the new defaults

Why LRU is the Better Default

1. Production Safety

  • In high-traffic web servers and backend systems, unbounded memory growth is a critical operational risk
  • A service handling thousands of unique partition IDs will eventually OOM with unbounded caching
  • Bounded caches provide predictable memory usage, essential for capacity planning and stability

2. Real-World Usage Patterns

  • Most applications have a working set of "hot" keys that are accessed frequently
  • LRU naturally keeps these hot keys in cache while evicting rarely-used ones
  • The 80/20 rule typically applies: 80% of requests hit 20% of keys
  • A 1000-entry LRU cache captures this working set effectively for most applications

3. Performance Characteristics

  • LRU with 1000 entries provides excellent hit rates for typical workloads
  • The performance difference between a cache hit and regenerating a key is orders of magnitude
  • Even with evictions, a bounded LRU cache significantly outperforms no caching
  • The overhead of LRU bookkeeping is negligible compared to cryptographic operations

4. Fail-Safe Defaults

  • Libraries should have safe defaults that work well in production
  • Users can always opt into unbounded caching if they have specific requirements
  • It's better to have slightly lower cache hit rates than to crash in production
  • New users are protected from a common pitfall

5. Industry Best Practices

  • Most production caching systems (Redis, Memcached, etc.) default to bounded sizes
  • Unbounded caches are typically considered an anti-pattern
  • Memory leaks from unbounded growth are a common source of production incidents

Impact

  • Before: Caches grow unbounded, potentially causing OOM
  • After: Caches are limited to 1000 entries (configurable) with LRU eviction

Breaking Change

This is technically a breaking change as the default behavior changes from unbounded to bounded caching. However:

  • This is safer for production use
  • Users can still explicitly set evictionPolicy = "simple" if they need unbounded caching
  • The bounded behavior prevents a common production issue

Testing

Added tests to verify:

  • Default policies are set correctly
  • Users can still override to use simple cache if needed

Migration Guide

If you require the old unbounded behavior, explicitly set the eviction policy:

policy := appencryption.NewCryptoPolicy()
policy.SystemKeyCacheEvictionPolicy = "simple"
policy.IntermediateKeyCacheEvictionPolicy = "simple"

However, we strongly recommend evaluating whether bounded caching would work for your use case, as it provides better production stability.

The simple cache implementation never evicts keys, leading to unbounded
memory growth. In production systems with many partitions/users, this
causes OOM issues.

Changes:
- Add DefaultKeyCacheEvictionPolicy constant set to "lru"
- Update NewCryptoPolicy to set eviction policies to LRU by default
- Update documentation to reflect new defaults
- Add tests to verify default behavior

This ensures that by default, caches have a maximum size (1000 entries)
and use LRU eviction when full. Users can still explicitly set "simple"
if they need the old unbounded behavior.

Breaking change: The default cache behavior now evicts old entries when
the cache is full. This is safer for production use but may impact
systems that relied on unbounded caching.
- Fix gofmt formatting in policy.go
- Fix gci import ordering in policy_test.go
- Set default eviction policy to 'lru' in NewCryptoPolicy()
- This prevents unbounded memory growth in production systems
- Maintain backward compatibility - users can still set 'simple' if needed
- Update benchmark tests to use simple cache since they access internals
- Add test to verify new defaults are set correctly
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant