- Introduction
- Technology Used
- Requirements
- Data Source
- Using this project
- Step 1. Initial setup
- Step 2. Running the workflow
- Alternatively, to run tasks manually (i.e. without using Luigi)
- Optionally, to replicate Luigi's web visualisation
- Step 3. Analysis on Metabase (Optional)
- Step 4. Viewing dbt documentation (Optional)
- Step 5. Done
- Future Features and Improvements
- Author
In this project, we want to run simple analysis on airport and arrival data.
We want to know:
- How many airports are there in Malaysia?
- What is the distance between the airports in Malaysia?
- How many flights are going to land in Malaysian airports the next day from the point of query?
- Which airport is most congested base on the information gathered in question 3 above?
To answer these questions, we built an end-to-end framework with ETL pipelines, a database, and an analytics platform as seen in the Project Framework diagram below. (Admittedly a little over-engineered but, hey! Why not...? )
The relationship between the tables used to answer the questions above can be visualised using the Table Lineage.
We also hosted a version of the dbt documentation on GitHub which would contain the table descriptions, data dictionaries, and the SQL queries—both raw and compiled—used to generate the tables.
Finally, to visualise the data and help us answer the questions, we used Metabase—an open source business intelligence tool—to create an Analytics Dashboard.
Covered in Step 4 (Optional) - view the dbt documentation
Covered in Step 4 (Optional) - view the dbt documentation
Database | (ETL/ELT) Pipeline | Workflow Orchestration | Analytics Platform |
---|---|---|---|
Postgres (using Docker) | Python 3.6 | Luigi (using Python) | Metabase (using Docker) |
dbt (Data Transformation) |
This project was created on macOS. It has not yet been tested on Windows.
-
Python 3.6 or above
- (Optional) Use a virtual environment - see this guide
IMPORTANT
- For libraries used, seerequirements.txt
. Includes:
-
- PostgresSQL - Database
- Metabase - Analytics Platform
-
Other Software
- DBeaver - SQL Client to view the database
We primarily use Airport and Arrival data from these two sources:
- Airport Data
- Arrival Data
- https://www.flightstats.com/v2/flight-tracker/search
- Specifically, we will scrape arrival information by looping through this site for each airport
- Example link scraped: https://www.flightstats.com/v2/flight-tracker/arrivals/KUL/?year=2020&month=2&date=18&hour=6
- https://www.flightstats.com/v2/flight-tracker/search
-
Install Python 3.6 using these instructions.
(Optional) Use a Python virtual environment
-
Install Poetry using the official documentation
Suggestion: use Homebrew to install
-
IMPORTANT
- Install required libraries using Poetry, at the command line:$ poetry install
This will install the Python libraries required for this project in a virtual environment.
-
luigi
- for (one-script) workflow orchestration -
dbt
- for uploading raw data and data transformation in database -
requests
- to scrape websites -
SQLAlchemy
- for database connection in scripts -
pandas
- for flattening of extracted (JSON) data and saving to csv -
PyYAML
- for configuration extraction
-
-
Install Docker using these instructions.
-
Install PostgresSQL via Docker using the instructions below (source), at the command line:
$ docker pull postgres $ mkdir -p $HOME/docker/volumes/postgres $ docker run --rm --name pg-docker -e POSTGRES_PASSWORD=docker -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data postgres
-
Install a suitable SQL Client such as DBeaver.
Use the default connection configuration to verify the database connection and view the data that will be added in the later steps:
Host: localhost Database: postgres Port: 5432 User: postgres Password: docker
This step utilises the Luigi workflow orchestration that automates the entire ELT pipeline.
To run each tasks in the workflow manually, see this guide.
-
Activate the virtual environment by running the following command:
$ poetry shell
-
Run the
workflow.py
file which will initiate the luigi workflow orchestration, at the command line:$ python workflow/workflow.py --local-scheduler DbtRunAnalysis
Luigi will work through all the steps as defined in
workflow.py
.Notes: This step may take some time since it is running the entire pipeline.
(Optional) Luigi has a web interface which can be used to monitor workflow and view dependency graphs as shown below. To replicate, follow this guide.
-
The following will be returned when the previous step completes successfully:
INFO: ===== Luigi Execution Summary ===== Scheduled 7 tasks of which: * 7 ran successfully: - 1 DbtDeps() - 1 DbtRunAirports() - 1 DbtRunAnalysis() - 1 DbtSeedAirports() - 1 DbtSeedArrivals() ... This progress looks :) because there were no failed tasks or missing dependencies ===== Luigi Execution Summary =====
Notes: You can view the 'logs' for each step in the files with
.output
extension that is generated in the root of the project directory. -
The data should ready in the database as tables and can be viewed using the SQL Client that was installed in Step 1.5 above.
Optional Step! Alternatively, the data can be viewed directly using the SQL client as mentioned in Step 1.5.
-
Install Metabase via Docker using the instructions below (source), at the command line:
$ docker run -d -p 3000:3000 --name metabase metabase/metabase
Wait a little while as it takes time to initialise. You can use
docker logs -f metabase
to follow the rest of the initialization progress. -
Once initialised, visit the local Metabase portal at http://localhost:3000/.
-
You will be directed to
http://localhost:3000/setup
, click onLet's get started
. -
Important step - Use the following database connection configuration to connect Metabase to our Postgres instance in Docker:
Name: Local (Or any other name) Host: host.docker.internal Port: 5432 Database name: postgres Database username: postgres Database password: docker
Please note that this has been tested to work on macOS. For Windows, if Metabase fails to connect to the above
Host
, please try referring to this official documentation on Docker Networking. -
Done! You should see the main Metabase page below which lists the available tables in the Postgres Database that was created in Step 2.
-
Unfortunately, there is no export feature to share the dashboard and charts seen in the Introduction. However, it is just a few simple
SELECT * FROM table
queries to the mainfct_
and some of thestg_
tables available in the Postgres database. 😄
Optional Step!
-
In the
./dbt
directory, run the following commands to generate the documentation:$ cd ./dbt $ dbt docs generate --profiles-dir ./
-
Once the documentation has been generated, run the following commands to host the documentation website locally:
$ dbt docs serve
-
The documentation website should automatically be launched in your browser.
If it does not, just navigate to http://localhost:8080/.
-
Alternatively, you can view the version we hosted on GitHub - dbt documentation.
As this project was only done in more or less 3 days, there were some features or improvements that were not implemented.
Here are a few suggestions:
- Use multiprocessing to scrape websites - Especially for scraping arrivals data as we loop through multiple airports.
- Implement the logging module - Right now, the scripts are using simple
print
statements. However, proper logging will help with debugging. - Data documentation using dbt - Documentation is important, but it is also a time consuming task. Whenever possible, we should document as much as possible the tables that were generated. Your downstream consumers will thank you.
- Use environmental variables for sensitive information - For passwords, in particular.
- Use a more versatile Workflow Orchestration tool - Luigi is simple and easy to get started. But one should consider alternatives like Airflow which can handle complex workflow and has more features.
- Lee Boon Keong - Feb, 2020