diff --git a/labs/2019/lab3_lr_svm_eval.ipynb b/labs/2019/lab3_lr_svm_eval.ipynb new file mode 100644 index 0000000..90decce --- /dev/null +++ b/labs/2019/lab3_lr_svm_eval.ipynb @@ -0,0 +1,600 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Lab 3: Logistic Regression, Support Vector Machines, and Evaluation\n", + "\n", + "\n", + "In this lab we'll get some hands on experience with two more classifiers we've seen in class\n", + "- Logitic Regression\n", + "- Support Vector Machines\n", + "\n", + "We will also explore evaluation metrics that we covered in class and understand how to calculate them." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Goals for this lab\n", + "\n", + "- Understand the practical implications for changing the parameters used in Logistic Regression and Support Vector Machines\n", + " \n", + "- Learn more about the evaluation metrics covered in class and learn how to calculate them (at different thresholds)\n", + " - accuracy\n", + " - precision\n", + " - recall\n", + " - AUC" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "import numpy as np\n", + "import pandas as pd\n", + "import sklearn.tree as tree\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.neighbors import KNeighborsClassifier\n", + "from sklearn.cross_validation import train_test_split\n", + "from sklearn.metrics import accuracy_score as accuracy\n", + "import graphviz # If you don't have this, install via pip/conda\n", + "%matplotlib inline\n", + "\n", + "# exercise: what additional modules should you import?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data\n", + "We'll continue to use the same data as in the previous lab.\n", + "\n", + "It is a subset of the data set from https://www.kaggle.com/new-york-state/nys-patient-characteristics-survey-pcs-2015\n", + "\n", + "The data has been downloaded, modified, and is in the github repo for the lab\n", + "\n", + "You should also try this with other data sets you have been provided for the homeworks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Change this to wherever you're storing your data\n", + "datafile = '../data/nysmedicaldata.csv'\n", + "df = pd.read_csv(dfile)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Some Quick Data Exploration\n", + "Before running any sort of model on your dataset, it's always a good idea to do some quick data exploration to get a sense of how your data looks like. Try to answer the following questions with some sort of plot/histogram/etc:\n", + "\n", + "1) What do the distributions of each feature look like?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# Ex\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Using scikitlearn for classification\n", + "\n", + "sklearn is a very useful python packager for building machiune learning models. To build a model in sklearn, you need to have a matrix (or dataframe) with X and y columns. X is your set of features/predictors. y is a single column that is your label. We'll take the foll;owing steps:\n", + "\n", + "1. Select/create column as label/outcome (y)\n", + "2. Select/create columns as features (X)\n", + "3. Create Training Set\n", + "4. Create Validation Set\n", + "5. Build model on Training Set\n", + "6. Predict risk scores for the Validation Set\n", + "7. Calculate performance metric(s)\n", + "\n", + "## Some useful things to know in sklearn\n", + "\n", + "fit = train an algorithm\n", + "\n", + "predict_proba = predict a \"risk\" score for all possible classes for a given record (classification only)\n", + "\n", + "\n", + "## Important- never use .predict\n", + "There is also a function called \"predict\" which first runs predict_probs and then predicts a 1 if the score > 0.5 and 0 otherwise. *Never* use that function since 0.5 is a completely arbitrary threshold to call a prediction 1 vs 0.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Create label/outcome\n", + "One thing we can do with this dataset is to try to use the various feature columns to classify whether a person has High Blood Pressure. Let's create a column that is 1 if a person has High Blood Pressure and 0 otherwise" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question: what percentage of people have High Blood Pressure?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. create or select existing predictors/features\n", + "\n", + "For now, let's take a handful of existing columns to use.\n", + "\n", + "sklearn needs features to be numeric and not categorical so we'll have to turn our selected features to be binary (also known as dummy variables)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# code" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Train/Test Splits\n", + "\n", + "Create a train/test set split using sklearn's [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. We'll use these train/test splits for evaluating all our classification models." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# code" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "# Logistic Regression\n", + "See the sklearn documentation on Logistic Regression to see its parameters. The one's we'll mostly be interested in are:\n", + "- penalty\n", + "- C" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Remember that when training a model, **you should only use the training data!** The test set is reserved exclusively for evaluating your model. Now let's use the classifier:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# code\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Logistic Regression Tasks:\n", + "\n", + "The goal here is to explore different penalty parameters and different C values. You can also try modofyinfg other parameters to see their impact. How does accuracy change, using different thresholds, as you vary penalty and C values? You can write a nested for loop that loops over all the parameters and values and store the results in a data frame (similar to last lab)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "collapsed": true + }, + "source": [ + "Ref: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html\n", + "\n", + "You'll notice that LogisticRegression takes a ton of parameters. We'll play around with the \"penalty\" and \"C\" parameters.\n", + "If we set the penalty parameter to ['l2'](http://mathworld.wolfram.com/L2-Norm.html), sklearn's LogisticRegression model solves the following minimization problem:\n", + "\n", + "$$ \\min_{\\beta} ||\\beta||_2 + C \\sum_{i} \\log ( -y_i (X_i^T \\beta) +1)$$\n", + "\n", + "Similarly, if we set the penalty parameter to ['l1'](http://mathworld.wolfram.com/L2-Norm.html), LogisticRegression will solve the following minimization problem:\n", + "\n", + "$$\\min_{\\beta} ||\\beta||_1 + C \\sum_{i} \\log ( -y_i (X_i^T \\beta) +1)$$\n", + "\n", + "where $$||\\beta||_2 = \\sqrt { \\sum_{i} \\beta_i^2 }$$ and $$||\\beta||_1 = \\sum_{i} | \\beta_i | $$ \n", + "\n", + "Try running logistic regression with both L1 and L2 penalties and a mix of C values. Something like $10^{-2}, 10^{-1}, 1, 10, 10^2)$ is reasonable." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Understanding what's going on inside Logistic Regression" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To really see the difference between L1 and L2 regularization, we need to take a closer look at the models they produced. Plot a histogram of the weight values of LogisticRegression models for each C value. You can access these weight coefficients via the coef\\_ attribute in LogisticRegression. Do you notice anything interesting happening as the C value varies?" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# code" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Support Vector Machines" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ref: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC\n", + "The SVM Classifier also takes quite a few parameters. For now we will use Linear SVMs. The model is called LinearSVC in sklearn.\n", + "\n", + "We will be playing with following parameters:\n", + "* C: same as above\n", + "\n", + "SVM tries to find the hyperplane that maximizes the \"margin\" between the two classes of points. The \"C\" parameter in [SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC) has the same role as the \"C\" parameter in LogisticRegression: it tells you how much to penalize the \"size\" of the weight vector. Note that SVC only allows for L2 regularization.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Let's fit an SVM" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Now predict scores on the test set and plot the distribution of scores\n", + "You might notice that the function you've been using to predict so far does not work. Is another function you need to use? Which one? Why?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### now we can select a threshold and calculate accuracy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Let's now vary values of C and see the results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evaluation Metrics" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We covered several evaluation metrics in class:\n", + " - accuracy\n", + " - precision\n", + " - recall\n", + " - area under curve\n", + " - ROC curves\n", + " \n", + "Although sklearn has built-in functions to calculate these metrics,\n", + "in this lab we want to give you an understanding of these metrics \n", + "by writing functions to calculate them yourself.\n", + "\n", + "Remember that accuracy, precision, and recall are calculated at a specific threshold for turning scores into 0 and 1.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set Threshold\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "threshold = " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### We will first create a confusion matrix based on this threshold" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "true_positives =\n", + "false_positive =\n", + "true_negatives =\n", + "false_negatives = " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Let's now write functions that can calculate each metric" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def calculate_accuracy_at_threshold(predicted_scores, true_labels, threshold):\n", + "\n", + "\n", + "def calculate_precision_at_threshold(predicted_scores, true_labels, threshold):\n", + "\n", + "\n", + "def calculate_recall_at_threshold(predicted_scores, true_labels, threshold):\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Now let's calculate all of these for a logistic regression model you built above" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Now let's write a function that generates the precision, recall, k (% of population) graph that we covered in class\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "def plot_precision_recall_k(predicted_scores, true_labels):" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### let's plot it for the same logistic regression model" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Now we build the same graph for an svm model and compare the two. Which one is better?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.14" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/labs/imgs/prk.png b/labs/imgs/prk.png new file mode 100644 index 0000000..e346fbe Binary files /dev/null and b/labs/imgs/prk.png differ