Data Analysis project focused on studying statistically the Spam-base dataset (collection of information about Spam or no-Spam emails) in order to achieve the best binary classification algorithm with the highest performance.
The project includes different phases that are the following:
- Data Acquisition.
- Data Cleaning and Organization with outliers and inconsistencies analysis, dataset balancing.
- Statistical exploration using Probability mass function, Cumulative Distribution Function, Gaussian Fitting, Variability Analysis (IQR) and Variables correlation.
- Classification with different algorithms with performance valutation (Logistic Regression with Backward Elimination, Decision Trees and Forest, K-NN).
- Clustering: K-MEANS.
Dataset informations:
- Number of instances: 4601 in which 1813 are SPAM (39.4%)
- Number of attributes: 58 (57 continues, 1 categorial (class label).
The last column of the data matrix represents the label related to the belonging class (1: spam, 0:no-spam). Many of the attributes indicate how frequently a particular word or character appears in the email text.
The dataset can be downloaded following the URL: