ENH Add Adult census dataset description

ArturoAmorQ · ArturoAmorQ · commit 210133218a69 · 2023-11-02T17:33:15.000+01:00
diff --git a/jupyter-book/_toc.yml b/jupyter-book/_toc.yml
@@ -213,7 +213,7 @@ parts:
   - file: appendix/datasets_intro
     sections:
     - file: python_scripts/trees_dataset
-    - file: appendix/adult_census_description
+    - file: python_scripts/datasets_adult_census
     - file: python_scripts/datasets_california_housing
     - file: python_scripts/datasets_ames_housing
     - file: python_scripts/datasets_blood_transfusion
diff --git a/jupyter-book/appendix/adult_census_description.md b/jupyter-book/appendix/adult_census_description.md
diff --git a/python_scripts/datasets_adult_census.py b/python_scripts/datasets_adult_census.py
@@ -0,0 +1,147 @@
+# ---
+# jupyter:
+#   kernelspec:
+#     display_name: Python 3
+#     name: python3
+# ---
+
+# %% [markdown]
+# # The Adult census dataset
+#
+# [This dataset](http://www.openml.org/d/1590) is a collection of demographic
+# information for the adult population as of 1994 in the USA. The prediction
+# task is to predict whether a person is earning a high or low revenue in
+# USD/year.
+#
+# The column named **class** is the target variable (i.e., the variable which we
+# want to predict). The two possible classes are `" <=50K"` (low-revenue) and
+# `" >50K"` (high-revenue).
+#
+# Before drawing any conclusions based on its statistics or the predictions of
+# models trained on it, remember that this dataset is not only outdated, but is
+# also not representative of the US population. In fact, the original data
+# contains a feature named `fnlwgt` that encodes the number of units in the
+# target population that the responding unit represents.
+#
+# First we load the dataset. We keep only some columns of interest to ease the
+# plotting.
+
+# %%
+import pandas as pd
+
+adult_census = pd.read_csv("../datasets/adult-census.csv")
+columns_to_plot = [
+    "age",
+    "education-num",
+    "capital-loss",
+    "capital-gain",
+    "hours-per-week",
+    "relationship",
+    "class",
+]
+target_name = "class"
+target = adult_census[target_name]
+
+# %% [markdown]
+# We explore this dataset in the first module's notebook "First look at our
+# dataset", where we provide a first intuition on how the data is structured.
+# There, we use a seaborn pairplot to visualize pairwise relationships between
+# the numerical variables in the dataset. This tool aligns scatter plots for every pair
+# of variables and histograms for the plots in the
+# diagonal of the array.
+#
+# This approach is limited:
+# - Pair plots can only deal with numerical features and;
+# - by observing pairwise interactions we end up with a two-dimensional
+#   projection of a multi-dimensional feature space, which can lead to a wrong
+#   interpretation of the individual impact of a feature.
+#
+# Here we explore with some more detail the relation between features using
+# plotly `Parcoords`.
+
+# %%
+import plotly.graph_objects as go
+from sklearn.preprocessing import LabelEncoder
+
+le = LabelEncoder()
+
+
+def generate_dict(col):
+    """Check if column is categorical and generate the appropriate dict"""
+    if adult_census[col].dtype == "object":  # Categorical column
+        encoded = le.fit_transform(adult_census[col])
+        return {
+            "tickvals": list(range(len(le.classes_))),
+            "ticktext": list(le.classes_),
+            "label": col,
+            "values": encoded,
+        }
+    else:  # Numerical column
+        return {"label": col, "values": adult_census[col]}
+
+
+plot_list = [generate_dict(col) for col in columns_to_plot]
+
+fig = go.Figure(
+    data=go.Parcoords(
+        line=dict(
+            color=le.fit_transform(target),
+            colorscale="Viridis",
+        ),
+        dimensions=plot_list,
+    )
+)
+fig.show()
+
+# %% [markdown]
+# The `Parcoords` plot is quite similar to the parallel coordinates plot that we
+# present in the module on hyperparameters tuning in this mooc. It display the
+# values of the features on different columns while the target class is color
+# coded. Thus, we are able to quickly inspect if there is a range of values for
+# a certain feature which is leading to a particular result.
+#
+# As in the parallel coordinates plot, it is possible to select one or more
+# ranges of values by clicking and holding on any axis of the plot. You can then
+# slide (move) the range selection and cross two selections to see the
+# intersections. You can undo a selection by clicking once again on the same
+# axis.
+#
+# In particular for this dataset we observe that values of `"age"` lower to 20
+# years are quite predictive of low-income, regardless of the value of other
+# features. Similarly, a `"capital-loss"` above `4000` seems to lead to
+# low-income.
+#
+# In this case we can additionaly observe that the variables `"age"` and
+# `"relationship"` are more correlated than the others:
+
+# %%
+import matplotlib.pyplot as plt
+import numpy as np
+
+from scipy.cluster import hierarchy
+from scipy.spatial.distance import squareform
+from scipy.stats import spearmanr
+
+X = adult_census[columns_to_plot].drop(columns="class")
+fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 8))
+corr = spearmanr(X).correlation
+
+# Ensure the correlation matrix is symmetric
+corr = (corr + corr.T) / 2
+np.fill_diagonal(corr, 1)
+
+# We convert the correlation matrix to a distance matrix before performing
+# hierarchical clustering using Ward's linkage.
+distance_matrix = 1 - np.abs(corr)
+dist_linkage = hierarchy.ward(squareform(distance_matrix))
+dendro = hierarchy.dendrogram(
+    dist_linkage, labels=X.columns.to_list(), ax=ax1, leaf_rotation=90
+)
+dendro_idx = np.arange(0, len(dendro["ivl"]))
+
+ax2.imshow(corr[dendro["leaves"], :][:, dendro["leaves"]])
+ax2.set_xticks(dendro_idx)
+ax2.set_yticks(dendro_idx)
+ax2.set_xticklabels(dendro["ivl"], rotation="vertical")
+ax2.set_yticklabels(dendro["ivl"])
+_ = fig.tight_layout()