Skip to content

ETL building for an e-commerce Jeans company. Feel free to access the Streamlit App in the link below.

License

Notifications You must be signed in to change notification settings

brunodifranco/project-star-jeans-data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Building for an E-commerce Jeans Company

A Data Engineering Project

Obs 1: The company and business problem are both fictitious, although the data is real.

Obs 2: Scraping the H&M website is allowed according to H&M's robots.txt file.

The in-depth Python code explanation is available in this Jupyter Notebook.

1. Star Jeans and Business Problem

Michael, Franklin and Trevor, after several successful businesses, are starting new a company called Star Jeans. For now, their plan is to enter the USA fashion market through an E-commerce. The initial idea is to sell one product for a specific audience, which is male jeans. Their goal is to keep prices low and slowly increase them, as they get new clients. However, this market already has strong competitors, such as H&M for instance. In addition to that, the three businessmen aren't familiar with this segment in particular. Therefore, in order to better understand how this market works they hired a Data Science/Engineering freelancer to gather information regarding H&M. They want to know the following information about H&M male jeans:

  • Product Name
  • Product Type
  • Product Fit
  • Product Color
  • Product Composition
  • Product Price

2. Solution Plan

2.1. How was the problem solved?

We managed to gather the information from H&M male jeans by creating an ETL, which consists of following steps (jobs):

  • Understanding the Business Problem: Understanding the main objective we are trying to achieve and plan the solution to it.

  • Extraction : Scraping product_id and product_type in showroom page (Job 01); getting other attributes from each product and saving it all in Pandas DataFrame (Job 02). More information in Section 3.

  • Transformation : Data Cleaning (Job 03). More information in Section 4.

  • Loading : Inserting data in a PostgreSQL Database (Job 04). More information in Section 5.

  • Streamlit App : Loading Database in the Streamlit App (Job 05); displaying data and adding filters in the Streamlit App (Job 06). More information in Section 6.

Here you can find the full ETL documentation, and below there's an illustration showing the complete ETL process and dynamic:

All jobs are performed sequentially. Jobs 01-04 are being run by this script, while the Streamlit App (Jobs 05 and 06) is built by this script. Jobs 01-04 are scheduled to run on a weekly basis via Windows Task Scheduler, which makes the Streamlit App (Jobs 05 an 06) also updates in the same period frequency, since Job 05 loads the Database in Streamlit, after it's been processed by the ETL.

2.2. Tools and techniques used:

3. Extraction

The extraction is made by scraping the H&M male jeans webpage, using Python and Beautiful Soup library. This process is divided in two jobs:

  • Job 01 : Gathering the web page link and product id for each product in the showroom, as well as the product type, since this information isn't available in each individual product page. Then, the product id is split in style id (first 7 digits) and color id (last 3 digits) for latter merging.

  • Job 02 : Getting other attributes from each product and saving it all in Pandas DataFrame. These other attributes were product name, fit, color, composition and price. In addition to that, a scraping_datetime variable is added every time this process is executed, showing when the scraping was done.

Job 02 is by far the most time consuming in terms of script running, out of all Jobs.

4. Transformation

After the full raw table is available it needs to be cleaned, which is Job 03. Firstly, all column names are set to snake case style, and all rows from product_color, product_fit, product_name and product_price are also set to snake case style.

The most difficult column to fix is the product_composition, since it's split in another six columns: cotton, spandex, elastomultiester, lyocell, rayon. Each one of these columns indicates how much, in percentage terms, they contribute to the product's composition. Finally the duplicated values are dropped and columns are rearranged. The final table definition is as follows:

Column Definition
product_id A 10-digit number uniquely assigned to each product, composed of style_id and color_id
style_id A 7-digit number uniquely assigned to each product style
color_id A 3-digit number assigned to each product color
product_name Product's name
product_type Product's type
product_color Product's color
product_fit Product's fit - if it's slim, skinny, loose, etc
cotton Percentage of cotton in the product's composition
spandex Percentage of spandex in the product's composition
polyester Percentage of polyester in the product's composition
elastomultiester Percentage of elastomultiester in the product's composition
lyocell Percentage of lyocell in the product's composition
rayon Percentage of rayon in the product's composition
product_price Product's unit price
scraping_datetime The Date of which the data scraping was performed

5. Loading

After the data is cleaned, the script inserts it in a PostgreSQL Database using Python's SQLAlchemy library. For this project, a free Database is being used from Neon.tech. This whole process is the Job 04.

It's important to notice that the new data is always being appended to the database, not replaced, so it can be possible to spot differences in prices for the same product over time.

6. Streamlit App

Streamlit was the chosen application to display the data since it's easy to create interactive tools, such as filters for instance. In addition to that, its deployment can be made directly through Streamlit Cloud itself, not requiring another Cloud. Once the data is added in the PostgreSQL database, the Streamlit App has two jobs:

  • Job 05 : Insert the data from the PostgreSQL Database to Streamlit.

  • Job 06 : Displaying data in a table and adding interactive filters to it.

Click below to access the App
Streamlit App

7. Conclusion

In this project the main objective was accomplished:

We managed to create an ETL process that extracts data from H&M, a Star Jeans competitor, cleans it, and saves it to a PostgreSQL database on a weekly basis. Then, the database can be added and displayed with filters in a Streamlit App, where it can be accessed from anywhere by Star Jeans' owners, so they can have a better understanding on how the USA male jeans market works.

8. Next Steps

Further on, this solution could be improved by using Apache Airflow instead of Windows Task Scheduler to automate the ETL process, since with Windows Task Scheduler the computer must be on for the script to run.

Contact