Skip to content

Commit

Permalink
ENH Add Adult census dataset description (#747)
Browse files Browse the repository at this point in the history
  • Loading branch information
ArturoAmorQ authored Nov 3, 2023
1 parent 3aa7191 commit 4851394
Show file tree
Hide file tree
Showing 6 changed files with 366 additions and 14 deletions.
2 changes: 1 addition & 1 deletion jupyter-book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -213,7 +213,7 @@ parts:
- file: appendix/datasets_intro
sections:
- file: python_scripts/trees_dataset
- file: appendix/adult_census_description
- file: python_scripts/datasets_adult_census
- file: python_scripts/datasets_california_housing
- file: python_scripts/datasets_ames_housing
- file: python_scripts/datasets_blood_transfusion
Expand Down
11 changes: 0 additions & 11 deletions jupyter-book/appendix/adult_census_description.md

This file was deleted.

203 changes: 203 additions & 0 deletions notebooks/datasets_adult_census.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Adult census dataset\n",
"\n",
"[This dataset](http://www.openml.org/d/1590) is a collection of demographic\n",
"information for the adult population as of 1994 in the USA. The prediction\n",
"task is to predict whether a person is earning a high or low revenue in\n",
"USD/year.\n",
"\n",
"The column named **class** is the target variable (i.e., the variable which we\n",
"want to predict). The two possible classes are `\" <=50K\"` (low-revenue) and\n",
"`\" >50K\"` (high-revenue).\n",
"\n",
"Before drawing any conclusions based on its statistics or the predictions of\n",
"models trained on it, remember that this dataset is not only outdated, but is\n",
"also not representative of the US population. In fact, the original data\n",
"contains a feature named `fnlwgt` that encodes the number of units in the\n",
"target population that the responding unit represents.\n",
"\n",
"First we load the dataset. We keep only some columns of interest to ease the\n",
"plotting."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
"columns_to_plot = [\n",
" \"age\",\n",
" \"education-num\",\n",
" \"capital-loss\",\n",
" \"capital-gain\",\n",
" \"hours-per-week\",\n",
" \"relationship\",\n",
" \"class\",\n",
"]\n",
"target_name = \"class\"\n",
"target = adult_census[target_name]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We explore this dataset in the first module's notebook \"First look at our\n",
"dataset\", where we provide a first intuition on how the data is structured.\n",
"There, we use a seaborn pairplot to visualize pairwise relationships between\n",
"the numerical variables in the dataset. This tool aligns scatter plots for every pair\n",
"of variables and histograms for the plots in the\n",
"diagonal of the array.\n",
"\n",
"This approach is limited:\n",
"- Pair plots can only deal with numerical features and;\n",
"- by observing pairwise interactions we end up with a two-dimensional\n",
" projection of a multi-dimensional feature space, which can lead to a wrong\n",
" interpretation of the individual impact of a feature.\n",
"\n",
"Here we explore with some more detail the relation between features using\n",
"plotly `Parcoords`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import plotly.graph_objects as go\n",
"from sklearn.preprocessing import LabelEncoder\n",
"\n",
"le = LabelEncoder()\n",
"\n",
"\n",
"def generate_dict(col):\n",
" \"\"\"Check if column is categorical and generate the appropriate dict\"\"\"\n",
" if adult_census[col].dtype == \"object\": # Categorical column\n",
" encoded = le.fit_transform(adult_census[col])\n",
" return {\n",
" \"tickvals\": list(range(len(le.classes_))),\n",
" \"ticktext\": list(le.classes_),\n",
" \"label\": col,\n",
" \"values\": encoded,\n",
" }\n",
" else: # Numerical column\n",
" return {\"label\": col, \"values\": adult_census[col]}\n",
"\n",
"\n",
"plot_list = [generate_dict(col) for col in columns_to_plot]\n",
"\n",
"fig = go.Figure(\n",
" data=go.Parcoords(\n",
" line=dict(\n",
" color=le.fit_transform(target),\n",
" colorscale=\"Viridis\",\n",
" ),\n",
" dimensions=plot_list,\n",
" )\n",
")\n",
"fig.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `Parcoords` plot is quite similar to the parallel coordinates plot that we\n",
"present in the module on hyperparameters tuning in this mooc. It display the\n",
"values of the features on different columns while the target class is color\n",
"coded. Thus, we are able to quickly inspect if there is a range of values for\n",
"a certain feature which is leading to a particular result.\n",
"\n",
"As in the parallel coordinates plot, it is possible to select one or more\n",
"ranges of values by clicking and holding on any axis of the plot. You can then\n",
"slide (move) the range selection and cross two selections to see the\n",
"intersections. You can undo a selection by clicking once again on the same\n",
"axis.\n",
"\n",
"In particular for this dataset we observe that values of `\"age\"` lower to 20\n",
"years are quite predictive of low-income, regardless of the value of other\n",
"features. Similarly, a `\"capital-loss\"` above `4000` seems to lead to\n",
"low-income.\n",
"\n",
"Even if it is beyond the scope of the present MOOC, one can additionally\n",
"explore correlations between features, for example, using Spearman's rank\n",
"correlation, as the more popular Pearson's correlation is only appropriate for\n",
"continuous data that is normally distributed and linearly related. Spearman's\n",
"correlation is more versatile in dealing with non-linear relationships and\n",
"ordinal data, but it is not meant for nominal categorical data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"\n",
"from scipy.cluster import hierarchy\n",
"from scipy.spatial.distance import squareform\n",
"from scipy.stats import spearmanr\n",
"\n",
"# Keep numerical features only\n",
"X = adult_census[columns_to_plot].drop(columns=[\"class\", \"relationship\"])\n",
"fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))\n",
"corr = spearmanr(X).correlation\n",
"\n",
"# Ensure the correlation matrix is symmetric\n",
"corr = (corr + corr.T) / 2\n",
"np.fill_diagonal(corr, 1)\n",
"\n",
"# We convert the correlation matrix to a distance matrix before performing\n",
"# hierarchical clustering using Ward's linkage.\n",
"distance_matrix = 1 - np.abs(corr)\n",
"dist_linkage = hierarchy.ward(squareform(distance_matrix))\n",
"dendro = hierarchy.dendrogram(\n",
" dist_linkage, labels=X.columns.to_list(), ax=ax1, leaf_rotation=90\n",
")\n",
"dendro_idx = np.arange(0, len(dendro[\"ivl\"]))\n",
"\n",
"ax2.imshow(corr[dendro[\"leaves\"], :][:, dendro[\"leaves\"]], cmap=\"coolwarm\")\n",
"ax2.set_xticks(dendro_idx)\n",
"ax2.set_yticks(dendro_idx)\n",
"ax2.set_xticklabels(dendro[\"ivl\"], rotation=\"vertical\")\n",
"ax2.set_yticklabels(dendro[\"ivl\"])\n",
"_ = fig.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a [diverging\n",
"colormap](https://matplotlib.org/stable/users/explain/colors/colormaps.html#diverging)\n",
"such as \"coolwarm\", the softer the color, the less (anti)correlation between\n",
"features (no correlation is mapped to white color). In this case dark blue\n",
"represents strong negative correlations and dark red means strong positive\n",
"correlations. Indeed, any feature is perfectly correlated to itself."
]
}
],
"metadata": {
"jupytext": {
"main_language": "python"
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
160 changes: 160 additions & 0 deletions python_scripts/datasets_adult_census.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# ---
# jupyter:
# kernelspec:
# display_name: Python 3
# name: python3
# ---

# %% [markdown]
# # The adult census dataset
#
# [This dataset](http://www.openml.org/d/1590) is a collection of demographic
# information for the adult population as of 1994 in the USA. The prediction
# task is to predict whether a person is earning a high or low revenue in
# USD/year.
#
# The column named **class** is the target variable (i.e., the variable which we
# want to predict). The two possible classes are `" <=50K"` (low-revenue) and
# `" >50K"` (high-revenue).
#
# Before drawing any conclusions based on its statistics or the predictions of
# models trained on it, remember that this dataset is not only outdated, but is
# also not representative of the US population. In fact, the original data
# contains a feature named `fnlwgt` that encodes the number of units in the
# target population that the responding unit represents.
#
# First we load the dataset. We keep only some columns of interest to ease the
# plotting.

# %%
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
columns_to_plot = [
"age",
"education-num",
"capital-loss",
"capital-gain",
"hours-per-week",
"relationship",
"class",
]
target_name = "class"
target = adult_census[target_name]

# %% [markdown]
# We explore this dataset in the first module's notebook "First look at our
# dataset", where we provide a first intuition on how the data is structured.
# There, we use a seaborn pairplot to visualize pairwise relationships between
# the numerical variables in the dataset. This tool aligns scatter plots for every pair
# of variables and histograms for the plots in the
# diagonal of the array.
#
# This approach is limited:
# - Pair plots can only deal with numerical features and;
# - by observing pairwise interactions we end up with a two-dimensional
# projection of a multi-dimensional feature space, which can lead to a wrong
# interpretation of the individual impact of a feature.
#
# Here we explore with some more detail the relation between features using
# plotly `Parcoords`.

# %%
import plotly.graph_objects as go
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()


def generate_dict(col):
"""Check if column is categorical and generate the appropriate dict"""
if adult_census[col].dtype == "object": # Categorical column
encoded = le.fit_transform(adult_census[col])
return {
"tickvals": list(range(len(le.classes_))),
"ticktext": list(le.classes_),
"label": col,
"values": encoded,
}
else: # Numerical column
return {"label": col, "values": adult_census[col]}


plot_list = [generate_dict(col) for col in columns_to_plot]

fig = go.Figure(
data=go.Parcoords(
line=dict(
color=le.fit_transform(target),
colorscale="Viridis",
),
dimensions=plot_list,
)
)
fig.show()

# %% [markdown]
# The `Parcoords` plot is quite similar to the parallel coordinates plot that we
# present in the module on hyperparameters tuning in this mooc. It display the
# values of the features on different columns while the target class is color
# coded. Thus, we are able to quickly inspect if there is a range of values for
# a certain feature which is leading to a particular result.
#
# As in the parallel coordinates plot, it is possible to select one or more
# ranges of values by clicking and holding on any axis of the plot. You can then
# slide (move) the range selection and cross two selections to see the
# intersections. You can undo a selection by clicking once again on the same
# axis.
#
# In particular for this dataset we observe that values of `"age"` lower to 20
# years are quite predictive of low-income, regardless of the value of other
# features. Similarly, a `"capital-loss"` above `4000` seems to lead to
# low-income.
#
# Even if it is beyond the scope of the present MOOC, one can additionally
# explore correlations between features, for example, using Spearman's rank
# correlation, as the more popular Pearson's correlation is only appropriate for
# continuous data that is normally distributed and linearly related. Spearman's
# correlation is more versatile in dealing with non-linear relationships and
# ordinal data, but it is not meant for nominal categorical data.

# %%
import matplotlib.pyplot as plt
import numpy as np

from scipy.cluster import hierarchy
from scipy.spatial.distance import squareform
from scipy.stats import spearmanr

# Keep numerical features only
X = adult_census[columns_to_plot].drop(columns=["class", "relationship"])
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
corr = spearmanr(X).correlation

# Ensure the correlation matrix is symmetric
corr = (corr + corr.T) / 2
np.fill_diagonal(corr, 1)

# We convert the correlation matrix to a distance matrix before performing
# hierarchical clustering using Ward's linkage.
distance_matrix = 1 - np.abs(corr)
dist_linkage = hierarchy.ward(squareform(distance_matrix))
dendro = hierarchy.dendrogram(
dist_linkage, labels=X.columns.to_list(), ax=ax1, leaf_rotation=90
)
dendro_idx = np.arange(0, len(dendro["ivl"]))

ax2.imshow(corr[dendro["leaves"], :][:, dendro["leaves"]], cmap="coolwarm")
ax2.set_xticks(dendro_idx)
ax2.set_yticks(dendro_idx)
ax2.set_xticklabels(dendro["ivl"], rotation="vertical")
ax2.set_yticklabels(dendro["ivl"])
_ = fig.tight_layout()

# %% [markdown]
# Using a [diverging
# colormap](https://matplotlib.org/stable/users/explain/colors/colormaps.html#diverging)
# such as "coolwarm", the softer the color, the less (anti)correlation between
# features (no correlation is mapped to white color). In this case dark blue
# represents strong negative correlations and dark red means strong positive
# correlations. Indeed, any feature is perfectly correlated to itself.
2 changes: 1 addition & 1 deletion python_scripts/linear_models_ex_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@
# Write your code here.

# %% [markdown]
# For the following questions, you can copy adn paste the following snippet to
# For the following questions, you can copy and paste the following snippet to
# get the feature names from the column transformer here named `preprocessor`.
#
# ```python
Expand Down
Loading

0 comments on commit 4851394

Please sign in to comment.