SMS Spam

Classifying SMS text messages into "spam" or "ham" (non-spam)

There is a dataset on Kaggle called the SMS Spam Collection dataset and it has several thousand SMS text messages labeled as either "spam", which we all know the meaning of and "ham", which is another way of saying "not spam". Naturally I am attempting classification using supervised learning here. Because the original dataset is only ~13% spam, I used over and under sampling to produce a nearly perfectly balanced dataset, sms-balanced.csv, which is what learn.R uses upon running.

The effort marked my first use of the tidymodels package and other associated packages in R. I feel they're almost but not quite what R really needs and I had a number of frustrations using it, which culminated in me throwing out any attempt to tune the final test set model which was fortunately unneeded to begin with. There is a very solid possibility that any further attempts I make at text classification will be done using the Python library spaCy.

Having said all of that, the project is currently the absolute best I've been able to do among the various machine learning efforts I've had a go at. Accuracy, sensitivity, specificity and AUC-ROC for the test set are all through the roof at over 0.99 each:

[1] "Evaluating performance on test set:"
# A tibble: 4 × 4
  .metric  .estimator .estimate .config
  <chr>    <chr>          <dbl> <chr>
1 accuracy binary         0.991 Preprocessor1_Model1
2 sens     binary         0.991 Preprocessor1_Model1
3 spec     binary         0.991 Preprocessor1_Model1
4 roc_auc  binary         0.998 Preprocessor1_Model1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SMS Spam

Classifying SMS text messages into "spam" or "ham" (non-spam)

Files

README.md

Latest commit

History

README.md

File metadata and controls

SMS Spam

Classifying SMS text messages into "spam" or "ham" (non-spam)