added lab6

dssg · May 13, 2019 · 78e4687 · 78e4687
1 parent f9029c2
commit 78e4687
Showing 1 changed file with 262 additions and 0 deletions.
diff --git a/labs/2019/lab6_feature_generation.ipynb b/labs/2019/lab6_feature_generation.ipynb
@@ -0,0 +1,262 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Lab 6: Feature Generation\n",
+    "\n",
+    "\n",
+    "In this lab we'll get some hands-on experience with generating features.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Goals for this lab\n",
+    "\n",
+    "- Understand the types of features that are usually used in ML projects\n",
+    "- Practice generating those features\n",
+    "- Explore the effect of different types of features on model performance\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "from sklearn.ensemble import BaggingClassifier,AdaBoostClassifier\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.metrics import accuracy_score as accuracy\n",
+    "from sklearn.metrics import roc_curve, auc\n",
+    "import graphviz # If you don't have this, install via pip/conda\n",
+    "from sklearn.metrics import f1_score, precision_recall_curve\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "%matplotlib inline\n",
+    "\n",
+    "# exercise: what additional modules should you import?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data\n",
+    "We'll use the data from donorschoose that we used in Assignment 3."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Change this to wherever you're storing your data\n",
+    "datafile = \n",
+    "df = pd.read_csv(datafile)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Create label/outcome\n",
+    "same as in the homework - predict if a project on donorschoose will not get fully funded within 60 days of posting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# code\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Feature Generation\n",
+    "We'll do this in a few iterations and run some models betweeen each iteration to see how the performance changes.\n",
+    " - Models: Let's take a few simple models to run - logistic regression (L2) and Random Forests (n_estimators = 1000)\n",
+    " - Training and Test Sets: For now, create one six month test set and use data before that as training set (same as in the homework)\n",
+    " - Metrics: Try AUCROC, Precision at 10% and 20%\n",
+    " \n",
+    "Feature Generation iterations:\n",
+    "\n",
+    "The main thing to remember here is that the features you generate are being generated as of the \"posting_date\" and can only use information up to that date.\n",
+    "\n",
+    "1. select existing columns that already exist in the raw data and prep them to run with sklearn models. This should be very similar to what you did in assignment 3. You'll create dummy variables from categorical variables.\n",
+    "\n",
+    "2. Could discretizing some of the varibles help? Try discretizing \"total_ammount\" and \"students_reached\" \n",
+    "\n",
+    "3. Aggregation:\n",
+    " - let's try simple aggregations such as number and percentage (2 different features) of projects that got fully funded in the last x days for several values of x (let's say 10, 30, 60)\n",
+    " - you can extend the previous features to spatial aggregations by limiting that to the same city/state/school as the project you are generating features for.\n",
+    " - you can use the lat long to generate the same features for projects within some distance y\n",
+    " \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# feature generation code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Let's test models now"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "raw_features = []\n",
+    "discretized_features = []\n",
+    "simple_aggregate_features = []\n",
+    "spatial_aggregate_features = []\n",
+    "\n",
+    "# now seelect which one(s) you want to test models with\n",
+    "selected_feature_groups  = []\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Create one (temporal) train and test split \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# code\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Imputation\n",
+    "Impute features that may be missing (separately on train and test set to avoid leakage). Each feature may be missing for a different reason so fill them appropriately (and generate missing flags as separate variables when necessary - remember what we talked about in class about this)\n",
+    "."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Train and Test models\n",
+    "- Build model(s) using the selected feature groups\n",
+    "- test model(s)\n",
+    "- evaluate\n",
+    "\n",
+    "You should do this for different subsets of feature groups above to get an idea of what the performance impact is"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# code"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Add more features\n",
+    "Can you think of other features (especially aggregate ones) that will be helpful?\n",
+    "  - avg amount for fully funded projects in the last x days within y distance (or same geographical area)?\n",
+    "  - difference between what this project is asking for and the feature above?\n",
+    "  - ...\n",
+    "  \n",
+    "Now create a new feature group and see how well do the models do with the additional feature(s)?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.14"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}