Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wk1 #48

Open
wants to merge 19 commits into
base: main
Choose a base branch
from
Open

Wk1 #48

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions greenery/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@

target/
dbt_packages/
logs/
15 changes: 15 additions & 0 deletions greenery/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Welcome to your new dbt project!

### Using the starter project

Try running the following commands:
- dbt run
- dbt test


### Resources:
- Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
- Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
- Join the [chat](https://community.getdbt.com/) on Slack for live discussions and support
- Find [dbt events](https://events.getdbt.com) near you
- Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
Empty file added greenery/analyses/.gitkeep
Empty file.
83 changes: 83 additions & 0 deletions greenery/corise_answers/wk1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
USE SCHEMA dev_db.dbt_jmanahan;

-- Number of users?
-- We have 130 users in our system.
SELECT
COUNT(1) n
, COUNT(DISTINCT u.user_id) n_users
FROM stg_user u
;


-- Average number of orders per hour?
-- Across the time period we've been making sales, we have about 7.7 orders in an average hour
-- Theoretically includes zero-order hours, although none of those have happened yet
SELECT
ABS(TIMESTAMPDIFF(HOUR, MAX(o.order_created_at), MIN(o.order_created_at))) n_hours
, COUNT(1) n_orders
, n_orders / n_hours AS orders_per_hour
FROM stg_order o
;

-- Every hour has 1+ orders
SELECT
DATE_TRUNC(HOUR, o.order_created_at) order_hour
, COUNT(1) n_orders
FROM stg_order o
GROUP BY 1
HAVING n_orders = 0
;


-- Average time placed-to-delivered of an order?
-- 3.89 days for landed orders, 3.81 days if including estimates for unlanded orders where available
SELECT
AVG(TIMESTAMPDIFF(SECOND, o.order_created_at, o.actual_delivered_at)) / 86400 AS days_to_deliver_actual
, AVG(
TIMESTAMPDIFF(
SECOND
, o.order_created_at
, NVL(o.actual_delivered_at, o.estimated_delivered_at)
)
) / 86400
AS days_to_deliver_best_guess
FROM stg_order o
;


-- How many users have 1, 2, 3+ orders?
-- 25 users have one order, 28 have 2, and 71 have more
WITH cte_order_count AS (
SELECT
o.user_id
, COUNT(1) n_orders
FROM stg_order o
GROUP BY 1
)
SELECT
CASE WHEN oc.n_orders >= 3 THEN '3+' ELSE oc.n_orders::VARCHAR END AS n_orders
, COUNT(1) n_users
FROM cte_order_count oc
GROUP BY 1
ORDER BY n_orders ASC
;


-- Average unique sessions per hour?
-- Across the time period we've been recording events, we have about 10.1 events in an average hour
-- Theoretically includes zero-session hours, although none of those have happened yet
SELECT
ABS(TIMESTAMPDIFF(HOUR, MAX(e.event_created_at), MIN(e.event_created_at))) n_hours
, COUNT(DISTINCT e.session_id) n_sessions
, n_sessions / n_hours AS orders_per_hour
FROM stg_event e
;

-- Every hour has 1+ sessions
SELECT
DATE_TRUNC(HOUR, e.event_created_at) event_hour
, COUNT(DISTINCT e.session_id) n_sessions
FROM stg_event e
GROUP BY 1
HAVING n_sessions = 0
;
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
64 changes: 64 additions & 0 deletions greenery/corise_answers/wk2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
### What is the user repeat rate (users w/ 2+ purchases / users w/ 1+ purchase)
80%

```SQL
SELECT
COUNT(IFF(NVL(um.user_order_count, 0) > 1, 1, NULL))
/ COUNT(IFF(NVL(um.user_order_count, 0) > 0, 1, NULL))
AS repeat_rate
FROM tbl_user_metrics um
```


### What are good indicators of a user who will likely purchase again
As this is a hypothetical, I'm going to guess without double checking
- Likely users may be seen by higher initial order values, faster delivery times, or that created an add to cart event
- Users unlikely to repeat may be those with few URLs visits or those that used deep discount promo codes
- We don't have data around how we acquired the user (ads probably convert at a lower rate)
- We don't have data around user profile, like income level that may correlate with repeat rate


### See file structure changes and models within them for dim/fact and intermediate modeling
I added mart-level data for all the questions I expected stakeholders to ask.
I'm feeling a bit exposed on time-trend questions, but confident the data is there to answer them (just not super intuitively)
Naming conventions followed my personal preference rather than dim/fact because I've found `tbl_xx_metrics` to be better understood by less well trained coworkers
One metrics table was made at every granularity I aggregated to. There is some duplicative descriptive data where it is probably useful (ex. `street_address` in both `tbl_shipping_metrics` and `tbl_user_metrics`)
Unless obviously a marketing-only or product team only interest, it lives in core. No strong preference here


### See png file in same folder for DAG
It doesn't look great, but I only see one line that isn't necessary. Open to advice.


### Added a bunch of test instances. Here's some reasoning
- Numbers should be positive unless if it's from a DATEDIFF
- Primary keys should be unique and not null, and sometimes must reference the stg table PK too
- Foreign keys (stg_order_item) must reference the table to which they are FKs
- Margin percentage and a few others should not go above one
- And more

Although I wasn't ambitious or assumptive enough to find bad data through the tests used, I did learn that materializing as ephemeral does not play nicely with tests


### Ensure tests are passing regularly
This is dependent on the orchestration tools I have available. At my current job, I would exclusively use dbt build so that all tests would run immediately after all model updates. Furthermore, I would have this scheduled, perhaps hourly


### Which orders changed from a snapshot
Three orders changed from prepared to shipped status

```SQL
WITH cte_order_changes AS (
SELECT DISTINCT order_id
FROM snapshot_orders so
WHERE dbt_valid_to IS NOT NULL
)

SELECT so.*
FROM snapshot_orders so
JOIN cte_order_changes oc
ON so.order_id = oc.order_id
ORDER BY
so.order_id
, so.dbt_valid_from DESC
```
85 changes: 85 additions & 0 deletions greenery/corise_answers/wk3/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
### What is our overall conversion rate?

62%
```SQL
USE DATABASE dev_db;
USE SCHEMA dbt_jmanahan;

SELECT ROUND(SUM(sm.checkout_count) / COUNT(1), 4) overall_conversion_rate
FROM tbl_session_metrics sm
;

SELECT
FROM tbl_product_metrics pm
;
```


### What is our conversion rate by product?

Between 34% and 61%
```SQL
SELECT
pm.product_id
, pm.product_name
, ROUND(
pm.product_order_count / pm.product_session_count
, 4
) AS product_conversion_rate
FROM tbl_product_metrics pm
;
```


### Why the difference
SKIP. I'm here to learn the tools and strategies of the trade, not pretend I'm a data analytics stakeholder


### Make a macro
Macro order_revenue is used in tbl_product_metrics and tbl_order_ledger
The note in the course materials says "start here" for a specific formula
That specific formula doesn't make a lot of sense in this representation of the data
Because event types per session are aggregated once in the transformation
I have chosen to follow the conflicting instructions to think about what would improve the
Usability/modularity of the code


### Hooks
See the practice hook added to `stg_address.sql`
I couldn't think of a useful hook, although the GRANT SELECT idea originally in the
instructions could have been good (especially if the config option `grant` didn't exist)
When and how to use hooks was something I was hoping to get out of this class,
and it's disappointing to still not have any practical examples of when it's
worthwhile to add something as a post hook


### Packages
Added a package and usage one of the macros.
Similarly felt like more direction would have created a better learning environment
(perhaps a goal that's hard to code, but solved by a certain package)
Reading through package contents is a good skill
but in real life is more of a hunt for a specific functionality


### DAG changes
Skipped. There were no DAG changes with the changes made here.
This too seems like a place where more could be learned with greater structure.


### Snapshots
3 orders gained tracking IDs, shipping service, estimated delivery at, and status to shipped
```SQL
SELECT
*
FROM snapshot_orders
WHERE order_id IN (
SELECT order_id
FROM snapshot_orders
WHERE dbt_valid_to > '2022-10-17'
)
ORDER BY order_id, dbt_valid_to DESC
```

Heads up, this is saved in the course git repository as `delivery_update.sql` and `delivery_update.sh`
Would appreciate a little more professionalism

34 changes: 34 additions & 0 deletions greenery/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'greenery'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'greenery'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

models:
staging:
postgres:
+materialized: table
Empty file added greenery/macros/.gitkeep
Empty file.
14 changes: 14 additions & 0 deletions greenery/macros/_macro.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
version: 2

macros:
- name: order_revenue
description: >
price * quantity * (1 - discount)
Abstracted out from models to have a single place to make changes to the formula if needed
arguments:
- name: price_col
description: A numeric column name representing the price in the formula
- name: quantity_col
description: A numeric column name representing the quantity in the formula
- name: discount_col
description: A numeric column between zero and one representing the percent of revenue not collected
5 changes: 5 additions & 0 deletions greenery/macros/no_greater_than_one.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{% test no_greater_than_one(model, column_name) %}
SELECT *
FROM {{ model }}
WHERE {{ column_name }} > 1
{% endtest %}
5 changes: 5 additions & 0 deletions greenery/macros/not_negative.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{% macro not_negative(model, column_name) %}
SELECT *
FROM {{ model }}
WHERE {{ column_name }} <= 0
{% endmacro %}
3 changes: 3 additions & 0 deletions greenery/macros/order_revenue.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{%- macro order_revenue(price_col, quantity_col, discount_col = 0) %}
{{ price_col }} * {{ quantity_col }} * (1 - {{ discount_col }})
{% endmacro -%}
5 changes: 5 additions & 0 deletions greenery/macros/positive_values.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{% test positive_values(model, column_name) %}
SELECT *
FROM {{ model }}
WHERE {{ column_name }} < 0
{% endtest %}
Loading