Skip to content

yogitasn/DataEngineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 

Repository files navigation

ETL Pipeline for Sparkify

This is ETL process built for Sparkify project to analyze the data they've been collecting on songs and user activity on their new music streaming app

The fact table 'songplays' and dimension tables schema 'users','songs','artists' and 'time' are created in sql_queries.py
The ETL process in Notebook 'etl.ipynb' is implemented for a single JSON file in song and log data folder
'etl.py' ETL file is processed for all the files in song and log data folders
The final table 'songplays' fetches the song and artist information from song and log files
The 'song id' and 'artist id' from the first song data set which is loaded on the songs and artists tables is to load
the fact table, based on the values from the log file(song,artist and length), we get the corresponding values from the song file
Perform following steps to get data into the tables

  1. python create_tables.py
  2. This will drop any existing tables and create the fact and dimension tables

  3. run etl.ipynb for a single JSON file
  4. This will insert records using single song and log JSON in all fact and dimension tables
    Run test.ipynb to check if the data is inserted in the tables

  5. python etl.py
  6. This will process all the JSON files in the song and log data folders and insert records into fact and dimension tables
    Run test.ipynb to check if all the data is inserted in the tables

The final table 'songplays' have the songs and user activity ready for the Sparkify analytical team displayed as below

Screenshot

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published