Skip to content

Data Analysis project based on the classification between Spam and No-Spam emails.

Notifications You must be signed in to change notification settings

AdrianaMacc/DataAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Spam-base (Spam/noSpam) Classification

Data Analysis project focused on studying statistically the Spam-base dataset (collection of information about Spam or no-Spam emails) in order to achieve the best binary classification algorithm with the highest performance.

The project includes different phases that are the following:

  • Data Acquisition.
  • Data Cleaning and Organization with outliers and inconsistencies analysis, dataset balancing.
  • Statistical exploration using Probability mass function, Cumulative Distribution Function, Gaussian Fitting, Variability Analysis (IQR) and Variables correlation.
  • Classification with different algorithms with performance valutation (Logistic Regression with Backward Elimination, Decision Trees and Forest, K-NN).
  • Clustering: K-MEANS.

Dataset informations:

  • Number of instances: 4601 in which 1813 are SPAM (39.4%)
  • Number of attributes: 58 (57 continues, 1 categorial (class label).

The last column of the data matrix represents the label related to the belonging class (1: spam, 0:no-spam). Many of the attributes indicate how frequently a particular word or character appears in the email text.

The dataset can be downloaded following the URL:

https://archive-beta.ics.uci.edu/dataset/94/spambase