diff --git a/03 - First steps with Kedro.ipynb b/03 - First steps with Kedro.ipynb
new file mode 100644
index 0000000..6a1cd59
--- /dev/null
+++ b/03 - First steps with Kedro.ipynb
@@ -0,0 +1,472 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "source": [
+ "# First steps with Kedro\n",
+ "\n",
+ "\n",
+ "\n",
+ "**Goal**: Create a classifier that predicts whether a flight will be delayed or not, using the [nycflights13 data](https://github.com/hadley/nycflights13)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": "notes"
+ },
+ "tags": []
+ },
+ "source": [
+ "To see the end result,\n",
+ "\n",
+ "```\n",
+ "$ cd demo/delay-prediction\n",
+ "$ kedro viz run\n",
+ "```\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import ibis\n",
+ "\n",
+ "ibis.options.interactive = True"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## The `DataCatalog`\n",
+ "\n",
+ "Kedro’s [Data Catalog](https://docs.kedro.org/en/latest/data/) is a registry of all data sources available for use by the project. It offers a separate place to declare details of the datasets your projects use. Kedro provides built-in datasets for different file types and file systems so you don’t have to write any of the logic for reading or writing data.\n",
+ "\n",
+ "Kedro offers a range of datasets, including CSV, Excel, Parquet, Feather, HDF5, JSON, Pickle, SQL Tables, SQL Queries, Spark DataFrames, and more. They are supported with the APIs of pandas, spark, networkx, matplotlib, yaml, and beyond. It relies on fsspec to read and save data from a variety of data stores including local file systems, network file systems, cloud object stores, and Hadoop. You can pass arguments in to load and save operations, and use versioning and credentials for data access.\n",
+ "\n",
+ "To start using the Data Catalog, create an instance of the `DataCatalog` class with a dictionary configuration as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from kedro.io import DataCatalog"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "catalog = DataCatalog.from_config(\n",
+ " {\n",
+ " \"flights\": {\n",
+ " \"type\": \"ibis.TableDataset\",\n",
+ " \"table_name\": \"flights\",\n",
+ " \"connection\": {\n",
+ " \"backend\": \"duckdb\",\n",
+ " \"database\": \"nycflights13.ddb\",\n",
+ " \"read_only\": True,\n",
+ " },\n",
+ " }\n",
+ " }\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Each entry in the dictionary represents a **dataset**, and each dataset has a **type** as well as some extra properties. Datasets are Python classes that take care of all the I/O needs in Kedro. In this case, we're using `kedro_datasets.ibis.TableDataset`, you can read [its full documentation](https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.1/api/kedro_datasets.ibis.TableDataset.html) online.\n",
+ "\n",
+ "After the catalog is created, `catalog.list()` will yield a list of the available dataset names, which you can load using the `catalog.load()` method:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "catalog.list()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "flights = catalog.load(\"flights\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice that the resulting object is the exact same Ibis table we were using in the previous tutorial!"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "type(flights)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "flights"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## The `OmegaConfigLoader`\n",
+ "\n",
+ "Instead of creating the Data Catalog by hand like this, Kedro usually stores configuration in YAML files. To load them, Kedro offers a [configuration loader](https://docs.kedro.org/en/latest/configuration/configuration_basics.html) based on the [Omegaconf](https://omegaconf.readthedocs.io/) library called the `OmegaConfigLoader`. This adds several interesting features, such as\n",
+ "\n",
+ "- Consolidating different configuration files into one\n",
+ "- Substitution, templating\n",
+ "- [Resolvers](https://omegaconf.readthedocs.io/en/2.3_branch/custom_resolvers.html)\n",
+ "- And [much more](https://docs.kedro.org/en/latest/configuration/advanced_configuration.html)\n",
+ "\n",
+ "To start using it, first dump the catalog configuration to a `catalog.yml` file, and then use `OmegaConfigLoader` as follows:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%writefile catalog.yml\n",
+ "flights:\n",
+ " type: ibis.TableDataset\n",
+ " table_name: flights\n",
+ " connection:\n",
+ " backend: duckdb\n",
+ " database: nycflights13.ddb\n",
+ " read_only: true"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from kedro.config import OmegaConfigLoader\n",
+ "\n",
+ "config_loader = OmegaConfigLoader(\n",
+ " conf_source=\".\", # Directory where configuration files are located\n",
+ " config_patterns={\"catalog\": [\"catalog.yml\"]}, # For simplicity for this demo\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "catalog_config = config_loader.get(\"catalog\")\n",
+ "catalog_config"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As you can see, `config_loader.get(\"catalog\")` gets you the same dictionary we crafted by hand earlier.\n",
+ "\n",
+ "However, hardcoding the database path like that seems like an invitation to trouble. Let's declare a variable `_root` inside the YAML file using Omegaconf syntax and load the catalog config again:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%%writefile catalog.yml\n",
+ "_root: /workspaces/kedro-ibis-tutorial\n",
+ "\n",
+ "flights:\n",
+ " type: ibis.TableDataset\n",
+ " table_name: flights\n",
+ " connection:\n",
+ " backend: duckdb\n",
+ " database: \"${_root}/nycflights13.ddb\"\n",
+ " read_only: true"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "catalog_config = config_loader.get(\"catalog\")\n",
+ "catalog_config"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "catalog = DataCatalog.from_config(catalog_config)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "catalog.load(\"flights\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Nodes and pipelines\n",
+ "\n",
+ "Now comes the interesting part. Kedro structures the computation on Directed Acyclic Graphs (DAGs), which are created by instantiating `Pipeline` objects with a list of `Node`s. By linking the inputs and outpus of each node, Kedro is then able to perform a topological sort and produce a graph.\n",
+ "\n",
+ "Let's start creating a trivial pipeline with 1 node. That 1 node will be a preprocessing function that will manipulate the `dep_time`, `arr_delay`, and `air_time` columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def preprocess_flights(table):\n",
+ " return table.mutate(\n",
+ " dep_time=(\n",
+ " table.dep_time.lpad(4, \"0\").substr(0, 2)\n",
+ " + \":\"\n",
+ " + table.dep_time.substr(-2, 2)\n",
+ " + \":00\"\n",
+ " ).try_cast(\"time\"),\n",
+ " arr_delay=table.arr_delay.try_cast(int),\n",
+ " air_time=table.air_time.try_cast(int),\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "flights.select(\"year\", \"month\", \"day\", \"dep_time\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "preprocess_flights(flights).select(\"year\", \"month\", \"day\", \"dep_time\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice that this is a plain Python function, receiving an Ibis table and returning another Ibis table.\n",
+ "\n",
+ "Now, let's wrap it using the `node` convenience function from Kedro:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from kedro.pipeline import node\n",
+ "\n",
+ "n0 = node(func=preprocess_flights, inputs=\"flights\", outputs=\"preprocessed_flights\")\n",
+ "n0"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Conceptually, a `Node` is a wrapper around a Python function that defines a single step in a pipeline. It has inputs and outputs, which are the names of the Data Catalog datasets that the function will receive and return, respectively. Therefore, you could execute it as follows:\n",
+ "\n",
+ "```python\n",
+ "n0.func(\n",
+ " *[catalog.load(input_dataset) for input_dataset in n0.inputs],\n",
+ ")\n",
+ "```\n",
+ "\n",
+ "Let's not do that though, Kedro will take care of it.\n",
+ "\n",
+ "The next step is to assemble the pipeline. In this case, it will only have 1 node:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from kedro.pipeline import pipeline\n",
+ "\n",
+ "pipe = pipeline([n0])\n",
+ "pipe"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "And finally, you can now execute the pipeline. For the purposes of this tutorial, you can use Kedro's `SequentialRunner` directly:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from kedro.runner import SequentialRunner\n",
+ "\n",
+ "outputs = SequentialRunner().run(pipe, catalog=catalog)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The output of the `.run(...)` method will be \"Any node outputs that cannot be processed by the `DataCatalog`\". Since `preprocessed_flights` is not declared in the Data Catalog, it's right there in the dictionary:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "outputs.keys()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "outputs[\"preprocessed_flights\"]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Exercises\n",
+ "\n",
+ "### Exercise 1\n",
+ "\n",
+ "Complete the `catalog.yml` so that `weather` is included as well.\n",
+ "\n",
+ "_Extra points_ if you factor the connection details in a variable."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%load solutions/nb03_ex01_catalog.yml"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "\n",
+ "### Exercise 2\n",
+ "\n",
+ "Complete the data processing pipeline by defining a `create_model_input_table_function`:\n",
+ "\n",
+ "```python\n",
+ "def create_model_input_table(flights, weather) -> ir.Table:\n",
+ " ...\n",
+ "```\n",
+ "\n",
+ "(see the `join` explanation in the Ibis notebook)\n",
+ "\n",
+ "and then recreate the pipeline so that it has two nodes.\n",
+ "\n",
+ "_Extra points_ if your node drops the null values of the resulting table and selects only a subset of the columns."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%load solutions/nb03_ex02.py"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.9"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/solutions/nb03_ex01_catalog.yml b/solutions/nb03_ex01_catalog.yml
new file mode 100644
index 0000000..16c0261
--- /dev/null
+++ b/solutions/nb03_ex01_catalog.yml
@@ -0,0 +1,15 @@
+_root_folder: /workspaces/kedro-ibis-tutorial
+
+_connection:
+ backend: duckdb
+ database: "${_root_folder}/nycflights13.ddb"
+
+flights:
+ type: ibis.TableDataset
+ table_name: flights
+ connection: ${_connection}
+
+weather:
+ type: ibis.TableDataset
+ table_name: weather
+ connection: ${_connection}
diff --git a/solutions/nb03_ex02.py b/solutions/nb03_ex02.py
new file mode 100644
index 0000000..bbd1a90
--- /dev/null
+++ b/solutions/nb03_ex02.py
@@ -0,0 +1,30 @@
+def create_model_input_table(flights, weather):
+ return (
+ flights.mutate(
+ arr_delay=flights.arr_delay >= 30,
+ date=flights.time_hour.date(),
+ )
+ .inner_join(weather, ["origin", "time_hour"])
+ .select(
+ "dep_time",
+ "flight",
+ "origin",
+ "dest",
+ "air_time",
+ "distance",
+ "carrier",
+ "date",
+ "arr_delay",
+ "time_hour",
+ )
+ .dropna()
+ )
+
+pipe = pipeline([
+ n0,
+ node(
+ func=create_model_input_table,
+ inputs=["preprocessed_flights", "weather"],
+ outputs="model_input_table",
+ ),
+])
diff --git a/static/kedro-final-pipeline.png b/static/kedro-final-pipeline.png
new file mode 100644
index 0000000..26fa13c
Binary files /dev/null and b/static/kedro-final-pipeline.png differ
diff --git a/static/kedro-horizontal-color-on-light.png b/static/kedro-horizontal-color-on-light.png
new file mode 100644
index 0000000..086c746
Binary files /dev/null and b/static/kedro-horizontal-color-on-light.png differ