Sparkify services music streaming app. Sparkify team is interested in which songs their users listening to, with their streaming app.
- Goal of analysis is to understand what songs users are listening to
- Currently, usage log data and song data are stored in a JSON format, which is hard to query on.
- Data engineers are to design and build Redshift DB using JSON data, so that Sparkify team can easily query their data.
Since this database is for analytical purpose, I chose star schema for analytical query performance. It is not normalized to 3NF form, so this schema may lacks in terms of handling redundancy and consistency.
- Fact tables
- song plays log table
- Dimension tables
- songs table
- artist table
- user table
- time table
- Songs and log JSON files (in S3 bucket) are loaded as Spark dataframe.
- Using the Spark dataframes, we create output dataframes that correspond to fact & dimension tables.
- Write output dataframes to S3 bucket as parquet file format
- Extracting songs and artist data
- Data for songs and artists relations are extracted from song_data (JSON format)
- From song_data,
- song_id, title, artist_id, year, duration are loaded onto songs relation
- artist_id, artist_name, artist_location, artist_latitude, artist_longitude are loaded onto artists relation
- Extracting user, time, log data
- Data for user, time and song plays relations are extracted from log_data(JSON format)
- From log_data,
- ts (timestamp) data is loaded onto time relation. Hour, day, week of year, month and weekday values are extracted from ts before loaded.
- user_id, first_name, last_name, gender, level are loaded onto users relation
- songplays table, which is a fact table, has ts, user_id, level, song_id, artist_id, session_id, location, user_agent as its columns
- In the EMR SSH, execute following command
spark-submit --master yarn etl.py
- Query below fetches the most played song in songplays table
SELECT songplays.song_id, title, COUNT(*) FROM songplays
JOIN songs ON songplays.song_id = songs.song_id
GROUP BY songplays.song_id, title
ORDER BY COUNT(*) DESC
LIMIT 1
song_id | title | count |
---|---|---|
SOBONKR12A58A7A7E0 | You're The One | 37 |