ETL Pipeline for Sparkify

This is ETL process built for Sparkify project to analyze the data they've been collecting on songs and user activity on their new music streaming app

The fact table 'songplays' and dimension tables schema 'users','songs','artists' and 'time' are created in sql_queries.py
The ETL process in Notebook 'etl.ipynb' is implemented for a single JSON file in song and log data folder
'etl.py' ETL file is processed for all the files in song and log data folders
The final table 'songplays' fetches the song and artist information from song and log files
The 'song id' and 'artist id' from the first song data set which is loaded on the songs and artists tables is to load
the fact table, based on the values from the log file(song,artist and length), we get the corresponding values from the song file
Perform following steps to get data into the tables

python create_tables.py

This will drop any existing tables and create the fact and dimension tables

run etl.ipynb for a single JSON file

This will insert records using single song and log JSON in all fact and dimension tables
Run test.ipynb to check if the data is inserted in the tables

python etl.py

This will process all the JSON files in the song and log data folders and insert records into fact and dimension tables
Run test.ipynb to check if all the data is inserted in the tables

The final table 'songplays' have the songs and user activity ready for the Sparkify analytical team displayed as below

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
song_plays_table.PNG		song_plays_table.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL Pipeline for Sparkify

This is ETL process built for Sparkify project to analyze the data they've been collecting on songs and user activity on their new music streaming app

About

Releases

Packages

yogitasn/DataEngineering

Folders and files

Latest commit

History

Repository files navigation

ETL Pipeline for Sparkify

This is ETL process built for Sparkify project to analyze the data they've been collecting on songs and user activity on their new music streaming app

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages