Skip to content

Latest commit

 

History

History
283 lines (194 loc) · 16.3 KB

File metadata and controls

283 lines (194 loc) · 16.3 KB

Build and Run ETL Data Pipeline and BI with Luigi and Metabase on Alibaba Cloud

You can access the tutorial artifact including deployment script (Terraform), related source code, sample data and instruction guidance from the github project: https://github.com/alibabacloud-howto/opensource_with_apsaradb/tree/main/luigi_metabase

More tutorial around Alibaba Cloud Database, please refer to: https://github.com/alibabacloud-howto/database


Overview

Luigi (https://github.com/spotify/luigi) is a Python (3.6, 3.7, 3.8, 3.9 tested) package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more. Document reference: https://luigi.readthedocs.io/en/stable/.

Metabase (https://www.metabase.com/) is an open source business intelligence tool. It lets user ask questions about the data, and displays answers in formats that make sense, whether that’s a bar graph or a detailed table. Metabase uses the default application database (H2) when initially start using Metabase. To enhance with the database high availability behind the Metabase BI Server, we will use Alibaba Cloud Database RDS PostgreSQL as the backend database of Metabase. Metabase supports either PostgreSQL or MySQL as the backend database. On Alibaba Cloud, You can either use one of the following databases:

In this tutorial, we will show the case of using RDS PostgreSQL high availability edition for more stable production purpose.

Deployment architecture:

image.png


Index


Step 1. Use Terraform to provision ECS and database on Alibaba Cloud

If you are the 1st time to use Terraform, please refer to https://github.com/alibabacloud-howto/terraform-templates to learn how to install and use the Terraform on different operating systems.

Run the terraform script to initialize the resources (in this tutorial, we use RDS PostgreSQL as backend database of Metabase and a RDS PostgreSQL as the demo database showing the ETL data pipeline via Luigi task and BI in Metabase, so ECS and 2 RDS PostgreSQL instances are included in the Terraform script). Please specify the necessary information and region to deploy.

image.png

After the Terraform script execution finished, the ECS and RDS PostgreSQL instances information are listed as below.

image.png

  • eip_ecs: The public EIP of the ECS for Azkaban installation host
  • rds_pg_url_metabase_database: The connection URL of the backend RDS PostgreSQL database for Metabase
  • rds_pg_port_metabase_database: The connection port of the backend RDS PostgreSQL database for Metabase, by default, it is 5432 for RDS PostgreSQL
  • rds_pg_url_demo_database: The connection URL of the demo RDS PostgreSQL database using Luigi and Metabase
  • rds_pg_port_demo_database: The connection Port of the demo RDS PostgreSQL database using Luigi and Metabase, by default, it is 5432 for RDS PostgreSQL

Step 2. Deploy and setup Luigi and Metabase on ECS with RDS PostgreSQL

Please log on to ECS with ECS EIP. By default, the password is N1cetest, which is preset in the terraform provision script in Step 1. If you've already changed it, please update accordingly.

ssh root@<ECS_EIP>

image.png

Execute the following command to install gcc, python, related python modules, Luigi, JDK 8, Git and PostgreSQL client.

yum install -y gcc-c++*
yum install -y python39
yum install -y postgresql-devel
pip3 install psycopg2
pip3 install pandas
pip3 install mlxtend
pip3 install pycountry
pip3 install luigi

yum install -y java-1.8.0-openjdk-devel.x86_64
yum install -y git

cd ~
wget http://mirror.centos.org/centos/8/AppStream/x86_64/os/Packages/compat-openssl10-1.0.2o-3.el8.x86_64.rpm
rpm -i compat-openssl10-1.0.2o-3.el8.x86_64.rpm
wget http://docs-aliyun.cn-hangzhou.oss.aliyun-inc.com/assets/attach/181125/cn_zh/1598426198114/adbpg_client_package.el7.x86_64.tar.gz
tar -xzvf adbpg_client_package.el7.x86_64.tar.gz

image.png

Execute the commands to checkout the project files from Github and navigate to the project directory.

git clone https://github.com/alibabacloud-howto/opensource_with_apsaradb.git
cd opensource_with_apsaradb/luigi_metabase/

image.png

In this tutorial, I show the Metabase execution approach via running the Metabase JAR file. So please execute the following commands to download the Metabase JAR file.

cd ~/opensource_with_apsaradb/luigi_metabase/metabase
wget https://downloads.metabase.com/v0.40.3.1/metabase.jar

By default, Metabase uses the default application database (H2) when initially start using Metabase. But in this tutorial, I show the best practice of switching to a more production-ready database RDS PostgreSQL. Basically, it follows the official document Migrating from using the H2 database to Postgres or MySQL/MariaDB. Execute the following commands to migrate Metabase backend database from H2 to RDS PostgreSQL that was provisioned before in the Step 1. Please update <rds_pg_url_metabase_database> with the corresponding connection string.

cd ~/opensource_with_apsaradb/luigi_metabase/metabase
export MB_DB_TYPE=postgres
export MB_DB_DBNAME=metabase
export MB_DB_PORT=1921
export MB_DB_USER=metabase
export MB_DB_PASS=N1cetest
export MB_DB_HOST=<rds_pg_url_metabase_database>
java -jar metabase.jar load-from-h2 ~/opensource_with_apsaradb/luigi_metabase/metabase/metabase.db

image.png

image.png

Then execute the command to start Metabase using the RDS PostgreSQL as the backend database. Please update <rds_pg_url_metabase_database> with the corresponding connection string.

export MB_DB_TYPE=postgres
export MB_DB_DBNAME=metabase
export MB_DB_PORT=1921
export MB_DB_USER=metabase
export MB_DB_PASS=N1cetest
export MB_DB_HOST=<rds_pg_url_metabase_database>
java -jar metabase.jar

image.png

image.png

Once it is up and running, navigate to http://<ECS_EIP>:3000/

I've preset the following accounts in demo Metabase, please logon with the Admin User.

image.png

image.png


Step 3. Setup the demo RDS PostgreSQL database

Please log on to ECS with ECS EIP in another CLI window (DO NOT close the CLI window logged in Step 2).

ssh root@<ECS_EIP>

Before we demo the ETL data pipeline, let's execute the following commands to create the schema sales_dw and tables (CREATE TABLE DDL are within the SQL file https://github.com/alibabacloud-howto/opensource_with_apsaradb/blob/main/luigi_metabase/sales_dw_ddl.sql) in the demo RDS PostgreSQL database. Please replace <rds_pg_url_demo_database> with the corresponding connection string of the demo RDS PostgreSQL instance. When prompting the password of connecting to the schema sales_dw, please input N1cetest, which is preset in the terraform provision script in Step 1. If you've already changed it, please update accordingly.

cd ~/adbpg_client_package/bin
./psql -h<rds_pg_url_demo_database> -p1921 -Udemo sales_dw

In the PG client, execute the DDL SQL file and check that 6 empty tables are created.

\i ~/opensource_with_apsaradb/luigi_metabase/sales_dw_ddl.sql
select tablename from pg_tables where schemaname='public';

image.png

There are 3 tables as the source tables and 3 tables as the target tables in the demo ETL data pipeline:

  • product_info: a source table in the demo ETL data pipeline
  • invoice: a source table in the demo ETL data pipeline
  • customer_info: a source table in the demo ETL data pipeline
  • invoice_time: a target table in the demo ETL data pipeline
  • invoice_outliers: a target table in the demo ETL data pipeline
  • association_rules: a target table in the demo ETL data pipeline

Step 4. Run the demo ETL data pipeline on Luigi

Please log on to ECS with ECS EIP in another new CLI window (DO NOT close the CLI window logged in Step 2 and Step 3).

ssh root@<ECS_EIP>

Within this CLI window, execute the command to start luigi daemon.

luigid

image.png

Once it is up and running, navigate to http://<ECS_EIP>:8082/

image.png

Now, we can run the ETL data pipeline in Luigi. The following image shows the ETL data pipeline workflow in the demo.

image.png

Basically, the full ETL data pipeline code are in https://github.com/alibabacloud-howto/opensource_with_apsaradb/blob/main/luigi_metabase/data_pipeline.py. It will load the raw data in the local ECS disk under https://github.com/alibabacloud-howto/opensource_with_apsaradb/tree/main/luigi_metabase/data, then process and transform the data to local disk, and finally load the data into the RDS PostgreSQL database for Metabase BI reporting. The BI reports in Metabase has already been composed in this demo within Metabase.

Switch to the CLI window created at Step 3. Before execution, please edit the pipeline python code to change the demo database connection string URL to the value of <rds_pg_url_demo_database>.

cd ~/opensource_with_apsaradb/luigi_metabase
vim data_pipeline.py

image.png

Then execute the following commands to kick off a pipeline execution for the data at 2018-03-30 in this CLI window.

cd ~/opensource_with_apsaradb/luigi_metabase
PYTHONPATH='.' luigi --module data_pipeline CompleteDataDumpLoad --date 2018-03-30

The data pipeline execution summary shows at the end.

image.png

Refresh the Luigi web page http://<ECS_EIP>:8082/, you can see the data pipeline execution information.

image.png

image.png


Step 5. View the data in demo RDS PostgreSQL and BI report on Metabase

In the CLI window created at Step 3, execute the following commands to verify the data processed in the data pipeline. Please replace <rds_pg_url_demo_database> with the corresponding connection string of the demo RDS PostgreSQL instance. When prompting the password of connecting to the schema sales_dw, please input N1cetest, which is preset in the terraform provision script in Step 1. If you've already changed it, please update accordingly.

cd ~/adbpg_client_package/bin
./psql -h<rds_pg_url_demo_database> -p1921 -Udemo sales_dw

In the PG client, execute the SQL to view the data.

select tablename from pg_tables where schemaname='public';
select count(*) from association_rules; 
select count(*) from product_info;
select count(*) from invoice;
select count(*) from customer_info;
select count(*) from invoice_time;
select count(*) from invoice_outliers;

image.png

Then navigate to Metabase database Admin setting to update the target database to the demo RDS PostgreSQL database <rds_pg_url_demo_database>.

image.png

image.png

image.png

image.png

image.png

image.png

This tutorial is modified based on https://github.com/abhishekzambre/data-warehouse to run on Alibaba Cloud. There are some errors in the original source code, I've already fixed them and made them all work on Alibaba Cloud.