-
Create Python virtual environment and install all dependencies:
python -m venv env
source env/bin/activate
pip install -r requirements.txt
-
Initialize PostgreSQL container:
docker-compose up
-
Run and test dbt models on PostgreSQL database:
./run.sh
The company currently has end-customers referred to us by partners, such as lettings agents and mortgage brokers. The customer then signs up for our service, and we pay the partner a small commission per referrals. Referrals are on a company level: a customer who signs up for five companies counts as five referrals. Five customers in one company count as one referral.
Partners each have consultants, such as Joe Smith working at Lettings Agent A. The referrals are attributed to the specific consultant at a partner.
The data tables provided are as follows:
partners
id
created_at
updated_at
partner_type
lead_sales_contact
sales people
name
country
referrals
id
created_at
updated_at
company_id
partner_id
consultant_id
status
is_outbound
For referrals, the updated_at
field essentially says when the status went from pending
to either
disinterested
or successful
. Timestamps are in Unix Nano format.
is_outbound
is true
when we refer a customer to a partner, i.e. "upsell". In this case we send
them the customer, and they pay us a commission. We haven't done this very thoroughly yet, so most referrals are inbound.
Our sales people work in a "key account" model. Referrals come from partners, and a sales person typically manages partner accounts.
- Analysis of data sources
- Implementation of staging models for data sources
- Definition of business questions/metrics
- Building of data models
- Improvements
- Limitations
- Suggestions
The first step is the analysis and profiling of sources. The analysis is in analyses](analyses) folder. I chose Jupiter Notebook because is the most simple tool to show an interactive data analysis.
I started to get the shape and column data types of the dataset, and after that, the data profiling. For that, I used Pandas Profiling because is a great tool to get basic reports for exploratory data analysis.
The Sales People source contains 2 columns and 12 rows.
The columns are:
name
is a categorical column that contains information about the name of the sales partner;country
is a categorical column that contains information about the country to which each sales partner belongs to. There are no missing values, and all the observations represent distinct values, meaning that does not exist duplicate rows.
The sales partners' names are:
- Orange
- Apple
- Lion
- Tree
- Root
- Sky
- Cloud
- Horiz
- Leaf
- Daisy
- Tulip
- Fig
The sales partners belong to the following countries:
- Singapore
- HongKong
- UK
In terms of frequency, 50% of the partners are based in the UK, 33.33% in Hong Kong and 16.7% in Singapore.
More details about this source can be found here.
Based on that analysis, we can define a set of data quality validations for the sales_people
source:
name
column should be unique, not null and have the following values: Orange, Apple, Lion, Tree, Root, Sky, Cloud, Horiz, Leaf, Daisy, Tulip and Figcountry
column should be not null and have the following values: Singapore, HongKong and UK
The Partners source contains 5 columns and 522 rows.
The columns are:
id
: unique identifier of a partner - primary key;created_at
: datetime in Unix nano format, which the partner was created on the system;updated_at
: datetime in Unix nano format, which the partner was updated on the system;partner_type
: categorical column, which represents the type of the partner;lead_sales_contact
: categorical column, which represents the name of the sales partnership.
There are no missing values, and each observation represents a singular partner, meaning that, there are 522 partners distributed across different sales partnerships.
The types of partner types are the following:
- Agent
- IFA
- Developer
- Other
- Influencer
- Management company
- Lender
- Insurer
In terms of frequency, 67.2% of the partners are agents, 10.3% are IFA, and 5.9% are developers. The rest of 16.6% belongs to Other (4.8%), Influencer (4.4%), Management Company (3.4%), Lender (3.3%) and Insurer (0.6%).
The lead sales contact are the following:
- Sky
- Root
- Lion
- Tree
- Daisy
- Fig
- Horiz
- 0
- Leaf
- Cloud
- Apple
- Tulip
- Potato
Note that there is a lead sales contact that is not mapped on the sales_people table - '0'.
In terms of frequency, 23.0% of partners belong to Root lead sales contact, followed by Potato which represents 20.9%; Lion and Daisy with 8.8%; Apple with 8.2%; Sky with 7.1%; Tree with 6.9%; Leaf with 6.3%; Cloud with 3.8%; Horiz with 3.4%, and the remaining lead sales contact with 2.7%.
More details about this source can be found here.
Based on that analysis, we can define a set of data quality validations for the partners
source:
id
column should be unique and not null;created_at
column should be not null;updated_at
column should be not null;partner_type
column should be not null, and have the following values: Agent, IFA, Developer, Other, Influencer, Management company, Lender and Insurer;lead_sales_partner
column should be not null and have the following values: Orange, Apple, Lion, Tree, Root, Sky, Cloud, Horiz, 0, Leaf, Daisy, Tulip and Fig.
The Referrals source contains 8 columns and 1470 rows.
The columns are:
id
: unique identifier of a referral - primary key;created_at
: datetime in Unix nano format, in which the referral was created on the system;updated_at
: datetime in Unix nano format, in which the referral changed the status on the system;company_id
: unique identifier of a company - foreign key;partner_id
: unique identifier of a partner - foreign key;consultant_id
: unique identifier of a consultant - foreign key;status
: categorial column that contains the referral's status;is_outbound
: boolean column; istrue
when the company refers a customer to a partner (upsell). In this case, the partner pays a commission to the company.
There are no missing values, and each observation is a new referral made by a partner, that includes details about the partner, consultant, status of the referral, etc.
Some findings found:
- The partners made referrals for 776 distinct companies;
- The referrals were made by 110 partners;
- There are 190 consultants across the partners;
- 59.7% of referrals were successful;
- 36.4% of referrals are pending;
- 3.9% of the customers classified the referral as disinterested;
- 63.4% of the referrals were made by partners;
- 36.6% of the referrals were made by the company to a partner (upsell).
More details about this source can be found here.
Based on that analysis, we can define a set of data quality validations for the referrals
source:
id
column should be unique and not null;created_at
column should be not null;updated_at
column should be not null;company_id
column should be not null;partner_id
column should be not null;consultant_id
column should be not null;status
column should be not null and have the following values: successful, pending and disinterested;is_outbound
column should be not null and have the following values: 0 and 1
Based on that analysis, we can define a set of SQL transformations on the staging model:
- Add unique identifier
- Convert UK to United Kingdom
- Convert HongKong in Hong Kong
- Rename
name
tolead_sales_name
- Add
ingested_at
column that represents the timestamp at which the row was ingested by dbt
Based on that analysis, we can define a set of SQL transformations on the staging model:
- Convert
created_at
column to human readable DateTime; - Convert
updated_at
column to human readable DateTime; - Convert IFA to Independent Financial Adviser;
- Rename
lead_sales_contact
column tolead_sales_name
; - Add
ingested_at
column that represents the timestamp at which the row was ingested by dbt.
Based on that analysis, we can define a set of SQL transformations on the staging model:
- Convert
created_at
column to human readable DateTime; - Convert
updated_at
column to human readable DateTime; - Convert
is_outbound
column to boolean logic; - Add
ingested_at
column that represents the timestamp at which the row was ingested by dbt.
Based on the operational data, I defined referrals' performance as my analysis. For that, I used referral status as the key to my analysis considering just "non-upsell" referrals.
The reason why chose that is because we can split referrals into three different statuses: succeded, pending and disinterested. From each status, we can get information at the level of partner, consultant, company, sales country, and time.
On top of those, we can build interesting reports/dashboards, such as:
- How many succeeded referrals were made by partner
- How many succeeded referrals were made by consultant
- How many disinterested referrals were made by partner
- How many disinterested referrals were made by consultant
- How many pending referrals do we have grouped by partner
- How many pending referrals do we have grouped by consultant
- What are the partner's countries with more succeeded referrals
- What are the partner's countries with more disinterested referrals
- And so on.
These metrics will allow the company to monitor the performance of referrals in a detailed grain.
The following image shows the DAG obtained:
stg_partners
,stg_sales_people
andstg_referrals
are the staging models. Each model bears a one-to-one relationship with the source data table it represents. It has the same granularity, but the columns have been renamed or recast.dim_partners
andfct_referrals
are the marts models. Thedim_partners
representspartners
business entity, and stores attribute about that. Thefct_referrals
represents the business process of referrals.referrals_disinterested
,referrals_succeeded
andreferrals_pending
are the reporting models, where each one represents a referral's status. These models already aggregate referrals bypartner_id
,consultant_id
andcompany_id
, giving the total of referrals based on the status. I decided to add this additional layer to become the model more related to the real business process and to facilitate the generation of reports on a dashboard tool (avoiding complex data aggregations on the dashboard side).
The following image shows the database schemas:
raw
: contains tables with raw data;staging
: contains staging models;marts
: contains dim and fact models;reporting
: contains BI models;
Based on the business metrics mentioned above, the company could get useful insights from Referrals models, such as:
- Would be easier to define goals with partners;
- Find the root cause of disinterested referrals and discuss with the partners how to improve it;
- Identify possible bottlenecks on referrals that are delaying the customer decision;
- And so on.