Skip to content

The classical data science intro with Kaggle's titanic data set is used to illustrate a simple but entire data science life-cycle including model optimization, train-test-split, eda and cleaning.

Notifications You must be signed in to change notification settings

dullibri/titanic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kaggle Titanic Competition

This notebook is about prediting the mortality ie survival of passengers of the Titanic for Kaggle. It contains a quick but still complete data science life cycle from data cleaning, feature engineering to model selection and prediction. It comprises a train-validation-test split set-up and is based on pipes to prevent data leakage. The score kaggle score so far is 78,5 percent.

Data Cleaning

Here, only three variables contain missings. Cleaning is based on standard median inputing (for continuous) and most frequent (categorical). However, a regression based approach to imputation for age is explored too.

Feature engineering

Particularly, it tackles

  • the names and titles of passengers and
  • the number of relatives. It starts the exploration of cabin and fare.

Model Selection

First, logistic regression is explored and its hyperparameters are tweaked. Then, the same is done for Random Forest and finally XGboost.

About

The classical data science intro with Kaggle's titanic data set is used to illustrate a simple but entire data science life-cycle including model optimization, train-test-split, eda and cleaning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published