.. toctree:: :maxdepth: 2 :hidden: Getting_Started Concepts Aggregations Bootstrap Python_API Kaggle_Outbrain Online_Offline_Consistency Code_Guidelines
Chronon is a feature engineering framework used to power Machine Learning at Airbnb and Stripe. Chronon aims to make creating production-grade features easy.
With a simple feature definition, Chronon automatically creates infrastructure for generating training data, serving features and monitoring feature quality at scale.
With Chronon you can - * Consume data from a variety of Sources - event streams, DB table snapshots, change data streams, service endpoints and warehouse tables modeled as either slowly changing dimensions, fact or dimension tables * Produce results both online and offline contexts - Online, as scalable low-latency end-points for feature serving, or offline as hive tables, for generating training data. * Real-time or batch accuracy - You can configure the result to be either Temporal or Snapshot accurate. Temporal refers to updating feature values in real-time in online context and producing point-in-time correct features in the offline context. Snapshot accuracy refers to features being updated once a day at midnight. * Backfill training sets from raw data - without having to wait for months to accumulate feature logs to train your model. * Powerful python API - data source types, freshness and contexts are API level abstractions that you compose with intuitive SQL primitives like group-by, join, select etc., with powerful enhancements. * Automated feature monitoring - auto-generate monitoring pipelines to understand training data quality, measure training-serving skew and monitor feature drift.
Being able to flexibly compose these concepts to describe data processing is what makes feature engineering in Chronon productive.
This is what a simple Chronon Group-By looks like. This definition is used to automatically create offline datasets, feature serving end-points and data quality monitoring pipelines.
# same definition creates offline datasets and online end-points
view_features = GroupBy(
sources=[
EventSource(
# apply the transform on offline and streaming data
table="user_activity.user_views_table",
topic="user_views_stream",
query=query.Query(
# specify any spark sql expression fragments
# built-in functions, UDFs, arithmetic operations, inline-lambdas, struct types etc.
selects={
"view": "if(context['activity_type'] = 'item_view', 1 , 0)",
},
wheres=["user != null"]
))
],
# composite keys
keys=["user", "item"],
aggregations=[
Aggregation(
operation=Operation.COUNT,
# automatically explode aggregation list type input columns
input_column=view,
#multiple windows for the same input
windows=[Window(length=5, timeUnit=TimeUnit.HOURS)]),
],
# toggle between fresh vs daily updated features
accuracy=Accuracy.TEMPORAL,
)
If you wish to work in an existing chronon repo, simply run the command below.
pip install chronon-ai
If you wish to setup a chronon repo, for ease of orchestration, we recommend that you run the command below in an airflow repository.
curl -s https://chronon.ai/init.sh | $SHELL
Once you edit the spark_submit_path line in ./chronon/teams.json
you will be able to run offline jobs.
Find more details in the Getting Started section.