Skip to content

roadmap

Matt Tranquada edited this page Feb 2, 2018 · 1 revision

Data Analysis Roadmap Draft

The following is an outline constructed based on Carleton's articulation of the data analysis workflow. Links to other resources within the repo will be added as developed.

Domain Understanding

  • Defining the business problem
  • Defining the target variable

Data Acquisition

  • SQL
  • Webscraping/APIs
  • Flat files
  • Survey
  • Website/metadata

Quick EDA + Train/Test Split

  • Identifying weird distributions
  • Assessing target variable appropriateness
  • Assessing balance of classes in target variable
  • Identifying problems from data acquisition stage
  • Understanding scale of features

Data Preparation

  • Checking data types
  • Checking for NaN/null values
  • Checking for values that should be NaN/null
  • Removing "noise" observations
  • Separating numerical and categorical features

Feature Engineering/Extraction

  • Numeric
    • Standardization
    • Imputing missing variables
  • Categorical
    • Dealing with unbalanced classes
    • Creating dummy variables

EDA Redux/Coda

  • Plots
    • Scatter plots
    • Bar plots
    • Other plots
  • Feature Selection
    • Select k best
    • Select k percentile
    • RFE
    • Regularization

Modeling

  • Prediction
    • Linear regression
    • LASSO/Ridge/ElasticNet regression
    • DTR
    • Neural Networks
  • Classification
    • DTC
    • KNN
    • SVM
    • NN
    • Ensemble Methods
      • Bagging
      • Boosting

Evaluation

  • Prediction
    • R^2
    • MSE, RMSE, MedAE, MSLE
  • Classification
    • Accuracy
    • Precision/Recall/F1
    • ROC/AUC

Interpretation/Communication

  • Communicating how machine learning metrics translate into business value
    • Improving decision making
    • Saving money/increasing profits
    • Optimizing productivity
    • Saving time
    • Creating goodwill

Deployment

  • Can your model be used for predictive analysis?
    • Pipelines
    • Unit-testing