This is a Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence) which focuses on datas from https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks?select=tracks.csv
Check out our Project Video and Slide Deck too
Independent artists struggle to predict Spotify song popularity, leading to various drawbacks:
- As an independent artist, it is difficult to gauge the potential popularity of a new song before releasing it on Spotify
- Without data on how previous songs have performed, it's hard to make informed decisions about what to release next
- Even with data, it is challenging to interpret it, especially for artists without a background in data science or analytics.
With these problems, about 70% of indie artists generate less than $10,000 from their music annually, despite accounting for 41.4% of the music industry.
Problem Statment: Develop a predictive model to predict the popularity of a song on Spotify before its release.
This will allow independent artists to:
- Make more informed decisions about which songs to release and how to allocate their time and resources.
- Better negotiate deals with record labels, potentially leading to more opportunities and revenue.
- Tailor their marketing strategies and promotional efforts to maximize the song's impact upon release.
- Enables artists to identify strengths and weaknesses, facilitating artistic growth and improvement in future releases.
Raw Dataset used: https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks?select=tracks.csv
The DataPrepandCleaning notebook shows the cleaning and preparing process
The EDA notebook shows the data analysis process where each attribute and its relationship with populairty were explored. The analysis suggests that when evalauting popularity:
Non-negligible Factors:
- Duration
- Explicit
- Release Year
- Release Month
- Danceability
- Energy
- Loudness
- Acousticness
- Instrumentalness
Negligble Factors:
- Number of Artists
- Key
- Mode
- Speechiness
- Liveness
- Valence
- Tempo
- Time_Signature
The ML Models notebook shows how various ML models were used to predict populairty and the evaluation of these models.
- Linear Regression
- Random Forest
- Decision Tree
- KNN (Found in new technique folder)
KNN algorithm gave us the best model to be used for prediction of popularity.
Future Implementations:
- Web scraping and removing top artists from dataset to further target independent artists
- Going a step further and classifying songs by genre first and then evaluating each
Insights:
- Working with and cleaning a raw dataset to better address our problem statement
- Using techniques such as cross-validation to improve models
- Linking back model’s results to EDA to understand the workings of the model
- Using clustering to analyze data
- Modeling Using KNN and Random Forest for prediction
- Cross Validation to reduce overfitting and error
- Feature Selection on Models to determine importance
Dixit Ayushman - EDA, KNN algorithm, final insights
Summit Bajaj - Problem Formulation, Data Prep and Cleaning, 3 ML models
https://www.kaggle.com/datasets/yamaerenay/spotify-dataset-19212020-600k-tracks?select=tracks.csv https://www.hypebot.com/hypebot/2020/12/stats-facts-data-independent-artists-need-to-know.html