Skip to content
Lars Wander edited this page Mar 10, 2016 · 2 revisions

Clouddriver is in charge of caching all infrastructure for the accounts it manages. In order to scale Clouddriver while sharing the caching work, it is recommended to use the Netflix CATS API.

It allows the consumer of the API to define a number of "caching agents", identifiable by some ID string. Every 30 seconds, each agent will independently try to acquire a lock under its ID. The agent that gets the lock will cache the resources it has declared it is in charge of, and then give up the lock signifying that no other agents with that ID need to run, until the next cycle.

The agent must also declare for each resource type (instance, server group, etc...) whether it is INFORMATIVE or AUTHORITATIVE. Agents marked as INFORMATIVE for some type will only have visibility over a subset of that type's resources. For example, a load balancer caching agent will only be aware of and cache the instances that are attached to load balancers, therefore the caching framework should not flush instances from the cache no longer reported by the load balancer caching agent (it is possible that an instance was removed from a load balancer, even though it still exists). So while a load balancer caching agent is INFORMATIVE for type instances, it is AUTHORITATIVE for type load balancers, as it should report every load balancer for the account is is associated with, and if any load balancer is no longer reported by the caching agent, it can be flushed from the cache.

In order to allow for parts of the cache to be updated on demand, (it is nice to see a server group show up shortly after creation), the API allows for individual resources to be fetched and cached individually at any time. An example of work done to support cache safe cache updates can be found here clouddriver/pull#290. This alleviates the case following race condition:

             _____________________________________________________________________             
            |                                  __                                 |
            |                                 |  |                                |
<-----------+---------------------------------+--+--------------------------------+--------------------->
     loadData() starts and         onDemand(R) is started                  loadData() finishes, and 
     retrieves the state of        and immediately after writes            writes the state of resource
     resource R at time t0         the state of resource R at              R at time t0 (Bad!)
                                   time t1

Since loadData() is in charge of reading and storing far more data than a single onDemand(R) update, this is a pretty common race condition.