Spam-base (Spam/noSpam) Classification

Data Analysis project focused on studying statistically the Spam-base dataset (collection of information about Spam or no-Spam emails) in order to achieve the best binary classification algorithm with the highest performance.

The project includes different phases that are the following:

Data Acquisition.
Data Cleaning and Organization with outliers and inconsistencies analysis, dataset balancing.
Statistical exploration using Probability mass function, Cumulative Distribution Function, Gaussian Fitting, Variability Analysis (IQR) and Variables correlation.
Classification with different algorithms with performance valutation (Logistic Regression with Backward Elimination, Decision Trees and Forest, K-NN).
Clustering: K-MEANS.

Dataset informations:

Number of instances: 4601 in which 1813 are SPAM (39.4%)
Number of attributes: 58 (57 continues, 1 categorial (class label).

The last column of the data matrix represents the label related to the belonging class (1: spam, 0:no-spam). Many of the attributes indicate how frequently a particular word or character appears in the email text.

The dataset can be downloaded following the URL:

https://archive-beta.ics.uci.edu/dataset/94/spambase

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Data Analysis Project (EN).ipynb		Data Analysis Project (EN).ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam-base (Spam/noSpam) Classification

About

Releases

Languages

AdrianaMacc/DataAnalysis

Folders and files

Latest commit

History

Repository files navigation

Spam-base (Spam/noSpam) Classification

About

Topics

Resources

Stars

Watchers

Forks

Releases

Languages