diff --git a/README.md b/README.md index e048a6a..77738e1 100644 --- a/README.md +++ b/README.md @@ -118,7 +118,7 @@ df.head() ## Create training and test sets -Before we do anything we'll want to split our data into **_training_** and **_test_** sets. We'll accomplish this by first splitting the DataFrame into features (`X`) and target (`y`), then passing `X` and `y` to the `train_test_split()` function to create a 70/30 train test split. +Before we do anything we'll want to split our data into **_training_** and **_test_** sets. We'll accomplish this by first splitting the DataFrame into features (`X`) and target (`y`), then passing `X` and `y` to the `train_test_split()` function to split the data so that 70% of it is in the training set, and 30% of it is in the testing set. ```python @@ -255,7 +255,7 @@ ohe_df.head() One awesome feature of scikit-learn is the uniformity of its interfaces for every classifier -- no matter what classifier we're using, we can expect it to have the same important methods such as `.fit()` and `.predict()`. This means that this next part should feel familiar. -We'll first create an instance of the classifier with any parameter values, and then we'll fit our data to the model using `.fit()`. +We'll first create an instance of the classifier with any parameter values we have, and then we'll fit our data to the model using `.fit()`. ```python @@ -307,7 +307,7 @@ Image(graph.create_png()) ## Evaluate the predictive performance -Now that we have a trained model, we can generate some predictions, and go on to see how accurate our predictions are. We can use a simple accuracy measure, AUC, a confusion matrix, or all of them. This step is performed in the exactly the same manner, doesn't matter which classifier you are dealing with. +Now that we have a trained model, we can generate some predictions, and go on to see how accurate our predictions are. We can use a simple accuracy measure, AUC, a confusion matrix, or all of them. This step is performed in the exactly the same manner, so it doesn't matter which classifier you are dealing with. ```python diff --git a/index.ipynb b/index.ipynb index 16b812e..ef8f71b 100644 --- a/index.ipynb +++ b/index.ipynb @@ -1 +1 @@ -{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Building Trees using scikit-learn\n", "\n", "## Introduction\n", "\n", "In this lesson, we will cover decision trees (for classification) in Python, using scikit-learn and pandas. The emphasis will be on the basics and understanding the resulting decision tree. Scikit-learn provides a consistent interface for running different classifiers/regressors. For classification tasks, evaluation is performed using the same measures as we have seen before. Let's look at our example from earlier lessons and grow a tree to find our solution. \n", "\n", "## Objectives \n", "\n", "You will be able to:\n", "\n", "- Use scikit-learn to fit a decision tree classification model \n", "- Plot a decision tree using Python \n", "\n", "\n", "## Import necessary modules and data\n", "\n", "In order to prepare data, train, evaluate, and visualize a decision tree, we will make use of several modules in the scikit-learn package. Run the cell below to import everything we'll need for this lesson: "]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["import numpy as np \n", "import pandas as pd \n", "from sklearn.model_selection import train_test_split\n", "from sklearn.tree import DecisionTreeClassifier \n", "from sklearn.metrics import accuracy_score\n", "from sklearn.tree import export_graphviz\n", "from sklearn.preprocessing import OneHotEncoder\n", "from IPython.display import Image \n", "from sklearn.tree import export_graphviz\n", "from pydotplus import graph_from_dot_data"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The play tennis dataset is available in the repo as `'tennis.csv'`. For this step, we'll start by importing the csv file as a pandas DataFrame."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outlooktemphumiditywindyplay
0sunnyhothighFalseno
1sunnyhothighTrueno
2overcasthothighFalseyes
3rainymildhighFalseyes
4rainycoolnormalFalseyes
\n", "
"], "text/plain": [" outlook temp humidity windy play\n", "0 sunny hot high False no\n", "1 sunny hot high True no\n", "2 overcast hot high False yes\n", "3 rainy mild high False yes\n", "4 rainy cool normal False yes"]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["# Load the dataset\n", "df = pd.read_csv('tennis.csv')\n", "\n", "df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Create training and test sets\n", "\n", "Before we do anything we'll want to split our data into **_training_** and **_test_** sets. We'll accomplish this by first splitting the DataFrame into features (`X`) and target (`y`), then passing `X` and `y` to the `train_test_split()` function to create a 70/30 train test split."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["X = df.loc[:, ['outlook', 'temp', 'humidity', 'windy']]\n", "y = df.loc[:, 'play']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Encode categorical data as numbers\n", "\n", "Since all of our data is currently categorical (recall that each column is in string format), we need to encode them as numbers. For this, we'll use a handy helper object from sklearn's `preprocessing` module called `OneHotEncoder`."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outlook_overcastoutlook_rainyoutlook_sunnytemp_cooltemp_hottemp_mildhumidity_highhumidity_normalwindy_Falsewindy_True
00.00.01.01.00.00.00.01.01.00.0
11.00.00.00.01.00.01.00.01.00.0
20.00.01.00.01.00.01.00.00.01.0
30.01.00.00.00.01.01.00.00.01.0
40.01.00.01.00.00.00.01.01.00.0
\n", "
"], "text/plain": [" outlook_overcast outlook_rainy outlook_sunny temp_cool temp_hot \\\n", "0 0.0 0.0 1.0 1.0 0.0 \n", "1 1.0 0.0 0.0 0.0 1.0 \n", "2 0.0 0.0 1.0 0.0 1.0 \n", "3 0.0 1.0 0.0 0.0 0.0 \n", "4 0.0 1.0 0.0 1.0 0.0 \n", "\n", " temp_mild humidity_high humidity_normal windy_False windy_True \n", "0 0.0 0.0 1.0 1.0 0.0 \n", "1 0.0 1.0 0.0 1.0 0.0 \n", "2 0.0 1.0 0.0 0.0 1.0 \n", "3 1.0 1.0 0.0 0.0 1.0 \n", "4 0.0 0.0 1.0 1.0 0.0 "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["# One-hot encode the training data and show the resulting DataFrame with proper column names\n", "ohe = OneHotEncoder()\n", "\n", "ohe.fit(X_train)\n", "X_train_ohe = ohe.transform(X_train).toarray()\n", "\n", "# Creating this DataFrame is not necessary its only to show the result of the ohe\n", "ohe_df = pd.DataFrame(X_train_ohe, columns=ohe.get_feature_names(X_train.columns))\n", "\n", "ohe_df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Train the decision tree \n", "\n", "One awesome feature of scikit-learn is the uniformity of its interfaces for every classifier -- no matter what classifier we're using, we can expect it to have the same important methods such as `.fit()` and `.predict()`. This means that this next part should feel familiar.\n", "\n", "We'll first create an instance of the classifier with any parameter values, and then we'll fit our data to the model using `.fit()`. "]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort=False,\n", " random_state=None, splitter='best')"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["# Create the classifier, fit it on the training data and make predictions on the test set\n", "clf = DecisionTreeClassifier(criterion='entropy')\n", "\n", "clf.fit(X_train_ohe, y_train)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Plot the decision tree \n", "\n", "You can see what rules the tree learned by plotting this decision tree. To do this, you need to use additional packages such as `pytdotplus`. \n", "\n", "> **Note:** If you are run into errors while generating the plot, you probably need to install `python-graphviz` in your machine using `conda install python-graphviz`. "]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["# Create DOT data\n", "dot_data = export_graphviz(clf, out_file=None, \n", " feature_names=ohe_df.columns, \n", " class_names=np.unique(y).astype('str'), \n", " filled=True, rounded=True, special_characters=True)\n", "\n", "# Draw graph\n", "graph = graph_from_dot_data(dot_data) \n", "\n", "# Show graph\n", "Image(graph.create_png())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Evaluate the predictive performance\n", "\n", "Now that we have a trained model, we can generate some predictions, and go on to see how accurate our predictions are. We can use a simple accuracy measure, AUC, a confusion matrix, or all of them. This step is performed in the exactly the same manner, doesn't matter which classifier you are dealing with. "]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Accuracy: 0.6\n"]}], "source": ["X_test_ohe = ohe.transform(X_test)\n", "y_preds = clf.predict(X_test_ohe)\n", "\n", "print('Accuracy: ', accuracy_score(y_test, y_preds))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["##\u00a0Summary \n", "\n", "In this lesson, we looked at how to grow a decision tree using `scikit-learn`. We looked at different stages of data processing, training, and evaluation that you would normally come across while growing a tree or training any other such classifier. We shall now move to a lab, where you will be required to build a tree for a given problem, following the steps shown in this lesson. "]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3"}}, "nbformat": 4, "nbformat_minor": 2} \ No newline at end of file +{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Building Trees using scikit-learn\n", "\n", "## Introduction\n", "\n", "In this lesson, we will cover decision trees (for classification) in Python, using scikit-learn and pandas. The emphasis will be on the basics and understanding the resulting decision tree. Scikit-learn provides a consistent interface for running different classifiers/regressors. For classification tasks, evaluation is performed using the same measures as we have seen before. Let's look at our example from earlier lessons and grow a tree to find our solution. \n", "\n", "## Objectives \n", "\n", "You will be able to:\n", "\n", "- Use scikit-learn to fit a decision tree classification model \n", "- Plot a decision tree using Python \n", "\n", "\n", "## Import necessary modules and data\n", "\n", "In order to prepare data, train, evaluate, and visualize a decision tree, we will make use of several modules in the scikit-learn package. Run the cell below to import everything we'll need for this lesson: "]}, {"cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": ["import numpy as np \n", "import pandas as pd \n", "from sklearn.model_selection import train_test_split\n", "from sklearn.tree import DecisionTreeClassifier \n", "from sklearn.metrics import accuracy_score\n", "from sklearn.tree import export_graphviz\n", "from sklearn.preprocessing import OneHotEncoder\n", "from IPython.display import Image \n", "from sklearn.tree import export_graphviz\n", "from pydotplus import graph_from_dot_data"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The play tennis dataset is available in the repo as `'tennis.csv'`. For this step, we'll start by importing the csv file as a pandas DataFrame."]}, {"cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outlooktemphumiditywindyplay
0sunnyhothighFalseno
1sunnyhothighTrueno
2overcasthothighFalseyes
3rainymildhighFalseyes
4rainycoolnormalFalseyes
\n", "
"], "text/plain": [" outlook temp humidity windy play\n", "0 sunny hot high False no\n", "1 sunny hot high True no\n", "2 overcast hot high False yes\n", "3 rainy mild high False yes\n", "4 rainy cool normal False yes"]}, "execution_count": 2, "metadata": {}, "output_type": "execute_result"}], "source": ["# Load the dataset\n", "df = pd.read_csv('tennis.csv')\n", "\n", "df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Create training and test sets\n", "\n", "Before we do anything we'll want to split our data into **_training_** and **_test_** sets. We'll accomplish this by first splitting the DataFrame into features (`X`) and target (`y`), then passing `X` and `y` to the `train_test_split()` function to split the data so that 70% of it is in the training set, and 30% of it is in the testing set."]}, {"cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": ["X = df.loc[:, ['outlook', 'temp', 'humidity', 'windy']]\n", "y = df.loc[:, 'play']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Encode categorical data as numbers\n", "\n", "Since all of our data is currently categorical (recall that each column is in string format), we need to encode them as numbers. For this, we'll use a handy helper object from sklearn's `preprocessing` module called `OneHotEncoder`."]}, {"cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [{"data": {"text/html": ["
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outlook_overcastoutlook_rainyoutlook_sunnytemp_cooltemp_hottemp_mildhumidity_highhumidity_normalwindy_Falsewindy_True
00.00.01.01.00.00.00.01.01.00.0
11.00.00.00.01.00.01.00.01.00.0
20.00.01.00.01.00.01.00.00.01.0
30.01.00.00.00.01.01.00.00.01.0
40.01.00.01.00.00.00.01.01.00.0
\n", "
"], "text/plain": [" outlook_overcast outlook_rainy outlook_sunny temp_cool temp_hot \\\n", "0 0.0 0.0 1.0 1.0 0.0 \n", "1 1.0 0.0 0.0 0.0 1.0 \n", "2 0.0 0.0 1.0 0.0 1.0 \n", "3 0.0 1.0 0.0 0.0 0.0 \n", "4 0.0 1.0 0.0 1.0 0.0 \n", "\n", " temp_mild humidity_high humidity_normal windy_False windy_True \n", "0 0.0 0.0 1.0 1.0 0.0 \n", "1 0.0 1.0 0.0 1.0 0.0 \n", "2 0.0 1.0 0.0 0.0 1.0 \n", "3 1.0 1.0 0.0 0.0 1.0 \n", "4 0.0 0.0 1.0 1.0 0.0 "]}, "execution_count": 4, "metadata": {}, "output_type": "execute_result"}], "source": ["# One-hot encode the training data and show the resulting DataFrame with proper column names\n", "ohe = OneHotEncoder()\n", "\n", "ohe.fit(X_train)\n", "X_train_ohe = ohe.transform(X_train).toarray()\n", "\n", "# Creating this DataFrame is not necessary its only to show the result of the ohe\n", "ohe_df = pd.DataFrame(X_train_ohe, columns=ohe.get_feature_names(X_train.columns))\n", "\n", "ohe_df.head()"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Train the decision tree \n", "\n", "One awesome feature of scikit-learn is the uniformity of its interfaces for every classifier -- no matter what classifier we're using, we can expect it to have the same important methods such as `.fit()` and `.predict()`. This means that this next part should feel familiar.\n", "\n", "We'll first create an instance of the classifier with any parameter values we have, and then we'll fit our data to the model using `.fit()`. "]}, {"cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [{"data": {"text/plain": ["DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=None,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort=False,\n", " random_state=None, splitter='best')"]}, "execution_count": 5, "metadata": {}, "output_type": "execute_result"}], "source": ["# Create the classifier, fit it on the training data and make predictions on the test set\n", "clf = DecisionTreeClassifier(criterion='entropy')\n", "\n", "clf.fit(X_train_ohe, y_train)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Plot the decision tree \n", "\n", "You can see what rules the tree learned by plotting this decision tree. To do this, you need to use additional packages such as `pytdotplus`. \n", "\n", "> **Note:** If you are run into errors while generating the plot, you probably need to install `python-graphviz` in your machine using `conda install python-graphviz`. "]}, {"cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [{"data": {"image/png": "\n", "text/plain": [""]}, "execution_count": 6, "metadata": {}, "output_type": "execute_result"}], "source": ["# Create DOT data\n", "dot_data = export_graphviz(clf, out_file=None, \n", " feature_names=ohe_df.columns, \n", " class_names=np.unique(y).astype('str'), \n", " filled=True, rounded=True, special_characters=True)\n", "\n", "# Draw graph\n", "graph = graph_from_dot_data(dot_data) \n", "\n", "# Show graph\n", "Image(graph.create_png())"]}, {"cell_type": "markdown", "metadata": {}, "source": ["## Evaluate the predictive performance\n", "\n", "Now that we have a trained model, we can generate some predictions, and go on to see how accurate our predictions are. We can use a simple accuracy measure, AUC, a confusion matrix, or all of them. This step is performed in the exactly the same manner, so it doesn't matter which classifier you are dealing with. "]}, {"cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [{"name": "stdout", "output_type": "stream", "text": ["Accuracy: 0.6\n"]}], "source": ["X_test_ohe = ohe.transform(X_test)\n", "y_preds = clf.predict(X_test_ohe)\n", "\n", "print('Accuracy: ', accuracy_score(y_test, y_preds))"]}, {"cell_type": "markdown", "metadata": {}, "source": ["##\u00a0Summary \n", "\n", "In this lesson, we looked at how to grow a decision tree using `scikit-learn`. We looked at different stages of data processing, training, and evaluation that you would normally come across while growing a tree or training any other such classifier. We shall now move to a lab, where you will be required to build a tree for a given problem, following the steps shown in this lesson. "]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4"}}, "nbformat": 4, "nbformat_minor": 2} \ No newline at end of file