From f21e9dea59e0cbefc92c7db66bf8f3fbb63b27b6 Mon Sep 17 00:00:00 2001 From: Caio Camatta Date: Fri, 3 Jan 2025 13:34:31 -0500 Subject: [PATCH 1/3] Small doc changes --- docs/source/Tiled_Architecture.md | 79 +++++++++++++------------------ 1 file changed, 32 insertions(+), 47 deletions(-) diff --git a/docs/source/Tiled_Architecture.md b/docs/source/Tiled_Architecture.md index a97534e6c..4cfffcce9 100644 --- a/docs/source/Tiled_Architecture.md +++ b/docs/source/Tiled_Architecture.md @@ -1,54 +1,45 @@ # The Tiled Architecture -**Important**: Tiling is a new feature that is still in the process of being open-sourced. +**Important**: Tiling is a new feature currently being open-sourced. Feel free to reach out in the Chronon Slack channel if you need support using it. -## What is tiling? +## Tiling Overview -Tiling, or the tiled architecture, is a modification to Chronon's online architecture to store pre-aggregates (also -known as "IRs" or Intermediate Representations) in the Key-Value store instead of individual events. +The tiled architecture is a change to Chronon's online architecture that optimizes data storage and retrieval by utilizing pre-aggregates (Intermediate Representations or IRs) in the Key-Value store instead of individual events. -The primary purpose of tiling is to improve the handling of hot keys, increase scalability, and decrease feature serving -latency. +Tiling is primarily designed to improve the handling of hot keys, increase system scalability, and decrease feature serving latency. -Tiling requires [Flink](https://flink.apache.org/). +Tiling requires [Flink](https://flink.apache.org/) for stream processing. -### Chronon without tiling +### Traditional Chronon Architecture -The regular, untiled version works as pictured in Figure 1. +The regular, untiled architecture (Figure 1) works the following way: -- The "write" path: reads an event stream, processes the events in Spark, then writes them out to a datastore. -- The "read" path: reads O(events) events from the store, aggregates them, and returns the feature values to the user. +- Write path: ingests an event stream, processes the events, and writes them out individually to a datastore. +- Read path: retrieves O(events) from the store, aggregates them, and returns feature values to the client. ![Architecture](../images/Untiled_Architecture.png) -_Figure 1: The untiled architecture_ +_Figure 1: Regular, untiled Chronon architecture_ -At scale, aggregating O(n) events each time there is a request can become costly. For example, if you have an event -stream producing 10 events/sec for a certain key, a request for a feature with a 12-hour window will have to fetch and -aggregate 432,000 events every single time. For a simple GroupBy that counts the number of events for a key, Chronon -would iterate over 432,000 items and count the total. +At scale, aggregating O(n) events per request can become computationally expensive. For example, with an event stream generating 10 events/sec for a specific key, each feature request with a 12-hour window requires fetching and aggregating 432,000 events.For a simple GroupBy that counts the number of events for a key, Chronon +would iterate over 432,000 items to compute the total. -### Chronon with tiling +### Tiled Chronon Architecture -The tiled architecture, depicted in Figure 2, works differently: +The tiled architecture (Figure 2) operates as follows: -- The "write" path: reads an event stream, processes and pre-aggregates the events in a stateful Flink app, then writes - out the pre-aggregates to "tiles" in the store. -- The "read" path: reads O(tiles) tiles from the store, merges the pre-aggregates, and returns the feature values to the - user. +- Write path: ingests event stream, processes and pre-aggregates events using a stateful Flink app, and writes out pre-aggregates as "tiles" in the store. +- Read path: retrieves O(tiles) tiles from the store, merges pre-aggregates, and returns feature values to the client. ![Architecture](../images/Tiled_Architecture.png) -_Figure 2: The tiled architecture_ +_Figure 2: Tiled Chronon architecture_ -Tiling shifts a significant part of the aggregation work to the write path, which allows for faster feature serving. +This architecture shifts a significant part of aggregation work to the write path, allowing faster feature serving. -Using the same example as above (an event stream producing 10 events/sec for a certain key, and a GroupBy with a 12-hour -window), a request for feature values would fetch and merge 12 or 13 1-hour tiles. For a simple GroupBy that counts the -number of events for a key, Chronon would iterate over 13 numbers and add them together. That's significantly less work. +Using the same example as above (10 events/sec, 12-hour window GroupBy), a request for feature values in this architecture would only fetch and merge 12 or 13 1-hour tiles. For a simple count-based GroupBy, Chronon would iterate over 13 pre-computed sums and add them together. That's significantly less work. #### Example: Fetching and serving tiled features -Suppose you have a GroupBy with two aggregations, `COUNT` and `LAST`, both using 3-hour windows, and you are storing -1-hour tiles in KV Store. To serve them, the Chronon Fetcher would fetch three tiles: +Consider a GroupBy with `COUNT` and `LAST` aggregations using 3-hour windows, with 1-hour tiles in the KV Store. To serve these features, the Chronon Fetcher would retrieve three tiles: ``` [0:00, 1:00) -> [2, "B"] @@ -56,29 +47,23 @@ Suppose you have a GroupBy with two aggregations, `COUNT` and `LAST`, both using [2:00, 3:00) -> [3, "C"] ``` -Then, it would combine the IRs to get the final feature values: `[14, "C"]`. +The IRs would then be combined to produce the final feature values: `[14, "C"]`. -## When to use tiling +## Tiling Use Cases -In general, tiling improves scalability and decreases feature serving latency. Some use cases are: +Tiling is particularly beneficial for: -- You want to decrease feature serving latency. At Stripe, migrating to tiling decreased serving latency by 33% at 4K - rps. -- You don't have access to a datastore with range queries -- You want to reduce fanout to your datastore. -- You need to support aggregating over hot key entities +1. Reducing feature serving latency at scale (Stripe saw a 33% reduction after the initial implementation.) +3. Minimizing datastore load, as fetching O(tiles) is generally less work than fetching O(events). +4. Supporting aggregations over very hot key entities +2. Organizations that don't have access to a datastore with range query capabilities, such as Cassandra. -In particular, organizations operating at significant scale with many hot-key entities should consider using the tiled -architecture. If the number of events per entity key is at most a few thousand, the untiled approach would still perform -well. +Organizations operating at scale with many hot-key entities (such as big merchants in payments companies) should consider using the tiled architecture. If your hottest keys don't exceed a few thousand events per day, the untiled approach may still be sufficient. ## How to enable tiling -To enable tiling, you first need to start using Flink on the write path. See -the [Chronon on Flink documentation](setup/Flink.md) for instructions. As part of this process, you may also need to -modify your KV store implementation to know how to write and fetch tiles. +To enable tiling: -Once the Flink app is set up and writing tiles to your datastore, the final step is to enable tiled reads in the -Fetcher. Just add `enable_tiling=true` to -the [customJson](https://github.com/airbnb/chronon/blob/48b789dd2c216c62bbf1d74fbf4e779f23db541f/api/py/ai/chronon/group_by.py#L561) -of any GroupBy definition. +1. Implement Flink on the write path (refer to the [Chronon on Flink documentation](setup/Flink.md)). As part of this process, you may also need to + modify your KV store implementation to support writing and and fetching tiles. +2. Once your Flink app is writing tiles to the datastore, enable tiled reads in the Fetcher by adding `enable_tiling=true` to the [customJson](https://github.com/airbnb/chronon/blob/48b789dd2c216c62bbf1d74fbf4e779f23db541f/api/py/ai/chronon/group_by.py#L561) of any GroupBy definition. \ No newline at end of file From 4a30306ba470f07b6257a155f7a185e581c81f71 Mon Sep 17 00:00:00 2001 From: Caio Camatta Date: Fri, 3 Jan 2025 13:40:29 -0500 Subject: [PATCH 2/3] Small doc changes --- docs/source/Tiled_Architecture.md | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/docs/source/Tiled_Architecture.md b/docs/source/Tiled_Architecture.md index 4cfffcce9..76dd70e10 100644 --- a/docs/source/Tiled_Architecture.md +++ b/docs/source/Tiled_Architecture.md @@ -1,6 +1,6 @@ # The Tiled Architecture -**Important**: Tiling is a new feature currently being open-sourced. Feel free to reach out in the Chronon Slack channel if you need support using it. +**Important**: Tiling is a new feature currently being open-sourced. Feel free to reach out in the Chronon Slack channel if you need support using it. ## Tiling Overview @@ -20,8 +20,7 @@ The regular, untiled architecture (Figure 1) works the following way: ![Architecture](../images/Untiled_Architecture.png) _Figure 1: Regular, untiled Chronon architecture_ -At scale, aggregating O(n) events per request can become computationally expensive. For example, with an event stream generating 10 events/sec for a specific key, each feature request with a 12-hour window requires fetching and aggregating 432,000 events.For a simple GroupBy that counts the number of events for a key, Chronon -would iterate over 432,000 items to compute the total. +At scale, aggregating O(n) events per request can become computationally expensive. For example, with an event stream generating 10 events/sec for a specific key, each feature request with a 12-hour window requires fetching and aggregating 432,000 events. For a simple GroupBy that counts the number of events for a key, Chronon would iterate over 432,000 items to compute the total. ### Tiled Chronon Architecture @@ -54,9 +53,9 @@ The IRs would then be combined to produce the final feature values: `[14, "C"]`. Tiling is particularly beneficial for: 1. Reducing feature serving latency at scale (Stripe saw a 33% reduction after the initial implementation.) -3. Minimizing datastore load, as fetching O(tiles) is generally less work than fetching O(events). -4. Supporting aggregations over very hot key entities -2. Organizations that don't have access to a datastore with range query capabilities, such as Cassandra. +2. Minimizing datastore load, as fetching O(tiles) is generally less work than fetching O(events). +3. Supporting aggregations over very hot key entities +4. Organizations that don't have access to a datastore with range query capabilities, such as Cassandra. Tiling allows efficient retrieval of aggregated data without the need for complex range queries. Organizations operating at scale with many hot-key entities (such as big merchants in payments companies) should consider using the tiled architecture. If your hottest keys don't exceed a few thousand events per day, the untiled approach may still be sufficient. @@ -64,6 +63,5 @@ Organizations operating at scale with many hot-key entities (such as big merchan To enable tiling: -1. Implement Flink on the write path (refer to the [Chronon on Flink documentation](setup/Flink.md)). As part of this process, you may also need to - modify your KV store implementation to support writing and and fetching tiles. +1. Implement Flink on the write path (refer to the [Chronon on Flink documentation](setup/Flink.md)). As part of this process, you may also need to modify your KV store implementation to support writing and fetching tiles. 2. Once your Flink app is writing tiles to the datastore, enable tiled reads in the Fetcher by adding `enable_tiling=true` to the [customJson](https://github.com/airbnb/chronon/blob/48b789dd2c216c62bbf1d74fbf4e779f23db541f/api/py/ai/chronon/group_by.py#L561) of any GroupBy definition. \ No newline at end of file From b59fb8dd7bc8e78bd6e0e36874ad1befb4a390cc Mon Sep 17 00:00:00 2001 From: Caio Camatta Date: Fri, 3 Jan 2025 13:42:08 -0500 Subject: [PATCH 3/3] Clarify flink can be used for untiled --- docs/source/Tiled_Architecture.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/Tiled_Architecture.md b/docs/source/Tiled_Architecture.md index 76dd70e10..fc0ce9650 100644 --- a/docs/source/Tiled_Architecture.md +++ b/docs/source/Tiled_Architecture.md @@ -14,7 +14,7 @@ Tiling requires [Flink](https://flink.apache.org/) for stream processing. The regular, untiled architecture (Figure 1) works the following way: -- Write path: ingests an event stream, processes the events, and writes them out individually to a datastore. +- Write path: ingests an event stream, processes the events using either Spark Streaming or Flink, and writes them out individually to a datastore. - Read path: retrieves O(events) from the store, aggregates them, and returns feature values to the client. ![Architecture](../images/Untiled_Architecture.png)