-
Notifications
You must be signed in to change notification settings - Fork 0
roadmap
Matt Tranquada edited this page Feb 2, 2018
·
1 revision
The following is an outline constructed based on Carleton's articulation of the data analysis workflow. Links to other resources within the repo will be added as developed.
- Defining the business problem
- Defining the target variable
- SQL
- Webscraping/APIs
- Flat files
- Survey
- Website/metadata
- Identifying weird distributions
- Assessing target variable appropriateness
- Assessing balance of classes in target variable
- Identifying problems from data acquisition stage
- Understanding scale of features
- Checking data types
- Checking for NaN/null values
- Checking for values that should be NaN/null
- Removing "noise" observations
- Separating numerical and categorical features
- Numeric
- Standardization
- Imputing missing variables
- Categorical
- Dealing with unbalanced classes
- Creating dummy variables
- Plots
- Scatter plots
- Bar plots
- Other plots
- Feature Selection
- Select k best
- Select k percentile
- RFE
- Regularization
- Prediction
- Linear regression
- LASSO/Ridge/ElasticNet regression
- DTR
- Neural Networks
- Classification
- DTC
- KNN
- SVM
- NN
- Ensemble Methods
- Bagging
- Boosting
- Prediction
- R^2
- MSE, RMSE, MedAE, MSLE
- Classification
- Accuracy
- Precision/Recall/F1
- ROC/AUC
- Communicating how machine learning metrics translate into business value
- Improving decision making
- Saving money/increasing profits
- Optimizing productivity
- Saving time
- Creating goodwill
- Can your model be used for predictive analysis?
- Pipelines
- Unit-testing