Skip to content

Latest commit

 

History

History
92 lines (75 loc) · 4.06 KB

index.rst

File metadata and controls

92 lines (75 loc) · 4.06 KB
.. toctree::
   :maxdepth: 2
   :hidden:

   Getting_Started
   Concepts
   Aggregations
   Bootstrap
   Python_API
   Kaggle_Outbrain
   Online_Offline_Consistency
   Code_Guidelines

What is Chronon?

Chronon is a feature engineering framework used to power Machine Learning at Airbnb and Stripe. Chronon aims to make creating production-grade features easy.

With a simple feature definition, Chronon automatically creates infrastructure for generating training data, serving features and monitoring feature quality at scale.

../images/chronon_high_level.png

With Chronon you can - * Consume data from a variety of Sources - event streams, DB table snapshots, change data streams, service endpoints and warehouse tables modeled as either slowly changing dimensions, fact or dimension tables * Produce results both online and offline contexts - Online, as scalable low-latency end-points for feature serving, or offline as hive tables, for generating training data. * Real-time or batch accuracy - You can configure the result to be either Temporal or Snapshot accurate. Temporal refers to updating feature values in real-time in online context and producing point-in-time correct features in the offline context. Snapshot accuracy refers to features being updated once a day at midnight. * Backfill training sets from raw data - without having to wait for months to accumulate feature logs to train your model. * Powerful python API - data source types, freshness and contexts are API level abstractions that you compose with intuitive SQL primitives like group-by, join, select etc., with powerful enhancements. * Automated feature monitoring - auto-generate monitoring pipelines to understand training data quality, measure training-serving skew and monitor feature drift.

Being able to flexibly compose these concepts to describe data processing is what makes feature engineering in Chronon productive.

Example

This is what a simple Chronon Group-By looks like. This definition is used to automatically create offline datasets, feature serving end-points and data quality monitoring pipelines.

# same definition creates offline datasets and online end-points
view_features = GroupBy(
   sources=[
       EventSource(
           # apply the transform on offline and streaming data
           table="user_activity.user_views_table",
           topic="user_views_stream",
           query=query.Query(
               # specify any spark sql expression fragments
               # built-in functions, UDFs, arithmetic operations, inline-lambdas, struct types etc.
               selects={
                   "view": "if(context['activity_type'] = 'item_view', 1 , 0)",
               },
               wheres=["user != null"]
           ))
   ],
   # composite keys
   keys=["user", "item"],
   aggregations=[
       Aggregation(
           operation=Operation.COUNT,
           # automatically explode aggregation list type input columns
           input_column=view,
           #multiple windows for the same input
           windows=[Window(length=5, timeUnit=TimeUnit.HOURS)]),
   ],
   # toggle between fresh vs daily updated features
   accuracy=Accuracy.TEMPORAL,
)

Getting Started

If you wish to work in an existing chronon repo, simply run the command below.

pip install chronon-ai

If you wish to setup a chronon repo, for ease of orchestration, we recommend that you run the command below in an airflow repository.

curl -s https://chronon.ai/init.sh | $SHELL

Once you edit the spark_submit_path line in ./chronon/teams.json you will be able to run offline jobs. Find more details in the Getting Started section.