ETL Building for an E-commerce Jeans Company

A Data Engineering Project

Obs 1: The company and business problem are both fictitious, although the data is real.

Obs 2: Scraping the H&M website is allowed according to H&M's robots.txt file.

The in-depth Python code explanation is available in this Jupyter Notebook.

1. Star Jeans and Business Problem

Michael, Franklin and Trevor, after several successful businesses, are starting new a company called Star Jeans. For now, their plan is to enter the USA fashion market through an E-commerce. The initial idea is to sell one product for a specific audience, which is male jeans. Their goal is to keep prices low and slowly increase them, as they get new clients. However, this market already has strong competitors, such as H&M for instance. In addition to that, the three businessmen aren't familiar with this segment in particular. Therefore, in order to better understand how this market works they hired a Data Science/Engineering freelancer to gather information regarding H&M. They want to know the following information about H&M male jeans:

Product Name
Product Type
Product Fit
Product Color
Product Composition
Product Price

2. Solution Plan

2.1. How was the problem solved?

We managed to gather the information from H&M male jeans by creating an ETL, which consists of following steps (jobs):

Understanding the Business Problem: Understanding the main objective we are trying to achieve and plan the solution to it.
Extraction : Scraping product_id and product_type in showroom page (Job 01); getting other attributes from each product and saving it all in Pandas DataFrame (Job 02). More information in Section 3.
Transformation : Data Cleaning (Job 03). More information in Section 4.
Loading : Inserting data in a PostgreSQL Database (Job 04). More information in Section 5.
Streamlit App : Loading Database in the Streamlit App (Job 05); displaying data and adding filters in the Streamlit App (Job 06). More information in Section 6.

Here you can find the full ETL documentation, and below there's an illustration showing the complete ETL process and dynamic:

All jobs are performed sequentially. Jobs 01-04 are being run by this script, while the Streamlit App (Jobs 05 and 06) is built by this script. Jobs 01-04 are scheduled to run on a weekly basis via Windows Task Scheduler, which makes the Streamlit App (Jobs 05 an 06) also updates in the same period frequency, since Job 05 loads the Database in Streamlit, after it's been processed by the ETL.

2.2. Tools and techniques used:

Python 3.10.8, Pandas and Beautiful Soup.
SQL and PostgresSQL.
Jupyter Notebook and VSCode.
Web Scraping.
ETL Process and Windows Task Scheduler.
Streamlit.
Git and Github.

3. Extraction

The extraction is made by scraping the H&M male jeans webpage, using Python and Beautiful Soup library. This process is divided in two jobs:

Job 01 : Gathering the web page link and product id for each product in the showroom, as well as the product type, since this information isn't available in each individual product page. Then, the product id is split in style id (first 7 digits) and color id (last 3 digits) for latter merging.
Job 02 : Getting other attributes from each product and saving it all in Pandas DataFrame. These other attributes were product name, fit, color, composition and price. In addition to that, a scraping_datetime variable is added every time this process is executed, showing when the scraping was done.

Job 02 is by far the most time consuming in terms of script running, out of all Jobs.

4. Transformation

After the full raw table is available it needs to be cleaned, which is Job 03. Firstly, all column names are set to snake case style, and all rows from product_color, product_fit, product_name and product_price are also set to snake case style.

The most difficult column to fix is the product_composition, since it's split in another six columns: cotton, spandex, elastomultiester, lyocell, rayon. Each one of these columns indicates how much, in percentage terms, they contribute to the product's composition. Finally the duplicated values are dropped and columns are rearranged. The final table definition is as follows:

Column	Definition
product_id	A 10-digit number uniquely assigned to each product, composed of style_id and color_id
style_id	A 7-digit number uniquely assigned to each product style
color_id	A 3-digit number assigned to each product color
product_name	Product's name
product_type	Product's type
product_color	Product's color
product_fit	Product's fit - if it's slim, skinny, loose, etc
cotton	Percentage of cotton in the product's composition
spandex	Percentage of spandex in the product's composition
polyester	Percentage of polyester in the product's composition
elastomultiester	Percentage of elastomultiester in the product's composition
lyocell	Percentage of lyocell in the product's composition
rayon	Percentage of rayon in the product's composition
product_price	Product's unit price
scraping_datetime	The Date of which the data scraping was performed

5. Loading

After the data is cleaned, the script inserts it in a PostgreSQL Database using Python's SQLAlchemy library. For this project, a free Database is being used from Neon.tech. This whole process is the Job 04.

It's important to notice that the new data is always being appended to the database, not replaced, so it can be possible to spot differences in prices for the same product over time.

6. Streamlit App

Streamlit was the chosen application to display the data since it's easy to create interactive tools, such as filters for instance. In addition to that, its deployment can be made directly through Streamlit Cloud itself, not requiring another Cloud. Once the data is added in the PostgreSQL database, the Streamlit App has two jobs:

Job 05 : Insert the data from the PostgreSQL Database to Streamlit.
Job 06 : Displaying data in a table and adding interactive filters to it.

Click below to access the App

7. Conclusion

In this project the main objective was accomplished:

We managed to create an ETL process that extracts data from H&M, a Star Jeans competitor, cleans it, and saves it to a PostgreSQL database on a weekly basis. Then, the database can be added and displayed with filters in a Streamlit App, where it can be accessed from anywhere by Star Jeans' owners, so they can have a better understanding on how the USA male jeans market works.

8. Next Steps

Further on, this solution could be improved by using Apache Airflow instead of Windows Task Scheduler to automate the ETL process, since with Windows Task Scheduler the computer must be on for the script to run.

Contact

[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
star-jeans-etl		star-jeans-etl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
star-jeans.ipynb		star-jeans.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Building for an E-commerce Jeans Company

1. Star Jeans and Business Problem

2. Solution Plan

2.1. How was the problem solved?

2.2. Tools and techniques used:

3. Extraction

4. Transformation

5. Loading

6. Streamlit App

7. Conclusion

8. Next Steps

Contact

About

Releases

Packages

Contributors 2

Languages

License

brunodifranco/project-star-jeans-data-engineering

Folders and files

Latest commit

History

Repository files navigation

ETL Building for an E-commerce Jeans Company

1. Star Jeans and Business Problem

2. Solution Plan

2.1. How was the problem solved?

2.2. Tools and techniques used:

3. Extraction

4. Transformation

5. Loading

6. Streamlit App

7. Conclusion

8. Next Steps

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages