Skip to content

Commit adf38a8

Browse files
committed
[ci skip] ENH Convert some of the Wrap-up M4 content into exercise (#731) 008cff4
1 parent e0570d1 commit adf38a8

File tree

205 files changed

+3560
-1627
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

205 files changed

+3560
-1627
lines changed
Loading
Loading
Loading
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Binary file not shown.

_sources/python_scripts/linear_models_ex_03.py

+88-39
Original file line numberDiff line numberDiff line change
@@ -14,69 +14,118 @@
1414
# %% [markdown]
1515
# # 📝 Exercise M4.03
1616
#
17-
# The parameter `penalty` can control the **type** of regularization to use,
18-
# whereas the regularization **strength** is set using the parameter `C`.
19-
# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
20-
# this exercise, we ask you to train a logistic regression classifier using the
21-
# `penalty="l2"` regularization (which happens to be the default in
22-
# scikit-learn) to find by yourself the effect of the parameter `C`.
23-
#
24-
# We start by loading the dataset.
17+
# Now, we tackle a more realistic classification problem instead of making a
18+
# synthetic dataset. We start by loading the Adult Census dataset with the
19+
# following snippet. For the moment we retain only the **numerical features**.
20+
21+
# %%
22+
import pandas as pd
23+
24+
adult_census = pd.read_csv("../datasets/adult-census.csv")
25+
target = adult_census["class"]
26+
data = adult_census.select_dtypes(["integer", "floating"])
27+
data = data.drop(columns=["education-num"])
28+
data
2529

2630
# %% [markdown]
27-
# ```{note}
28-
# If you want a deeper overview regarding this dataset, you can refer to the
29-
# Appendix - Datasets description section at the end of this MOOC.
30-
# ```
31+
# We confirm that all the selected features are numerical.
32+
#
33+
# Compute the generalization performance in terms of accuracy of a linear model
34+
# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
35+
# cross-validation with `return_estimator=True` to be able to inspect the
36+
# trained estimators.
3137

3238
# %%
33-
import pandas as pd
39+
# Write your code here.
3440

35-
penguins = pd.read_csv("../datasets/penguins_classification.csv")
36-
# only keep the Adelie and Chinstrap classes
37-
penguins = (
38-
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
39-
)
41+
# %% [markdown]
42+
# What is the most important feature seen by the logistic regression?
43+
#
44+
# You can use a boxplot to compare the absolute values of the coefficients while
45+
# also visualizing the variability induced by the cross-validation resampling.
46+
47+
# %%
48+
# Write your code here.
4049

41-
culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
42-
target_column = "Species"
50+
# %% [markdown]
51+
# Let's now work with **both numerical and categorical features**. You can
52+
# reload the Adult Census dataset with the following snippet:
4353

4454
# %%
45-
from sklearn.model_selection import train_test_split
55+
adult_census = pd.read_csv("../datasets/adult-census.csv")
56+
target = adult_census["class"]
57+
data = adult_census.drop(columns=["class", "education-num"])
58+
59+
# %% [markdown]
60+
# Create a predictive model where:
61+
# - The numerical data must be scaled.
62+
# - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
63+
# group categories concerning less than 1% of the total samples.
64+
# - The predictor is a `LogisticRegression`. You may need to increase the number
65+
# of `max_iter`, which is 100 by default.
66+
#
67+
# Use the same 10-fold cross-validation strategy with `return_estimator=True` as
68+
# above to evaluate this complex pipeline.
4669

47-
penguins_train, penguins_test = train_test_split(penguins, random_state=0)
70+
# %%
71+
# Write your code here.
4872

49-
data_train = penguins_train[culmen_columns]
50-
data_test = penguins_test[culmen_columns]
73+
# %% [markdown]
74+
# By comparing the cross-validation test scores of both models fold-to-fold,
75+
# count the number of times the model using both numerical and categorical
76+
# features has a better test score than the model using only numerical features.
5177

52-
target_train = penguins_train[target_column]
53-
target_test = penguins_test[target_column]
78+
# %%
79+
# Write your code here.
5480

5581
# %% [markdown]
56-
# First, let's create our predictive model.
82+
# For the following questions, you can copy adn paste the following snippet to
83+
# get the feature names from the column transformer here named `preprocessor`.
84+
#
85+
# ```python
86+
# preprocessor.fit(data)
87+
# feature_names = (
88+
# preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
89+
# categorical_columns
90+
# )
91+
# ).tolist()
92+
# feature_names += numerical_columns
93+
# feature_names
94+
# ```
5795

5896
# %%
59-
from sklearn.pipeline import make_pipeline
60-
from sklearn.preprocessing import StandardScaler
61-
from sklearn.linear_model import LogisticRegression
97+
# Write your code here.
6298

63-
logistic_regression = make_pipeline(
64-
StandardScaler(), LogisticRegression(penalty="l2")
65-
)
99+
# %% [markdown]
100+
# Notice that there are as many feature names as coefficients in the last step
101+
# of your predictive pipeline.
66102

67103
# %% [markdown]
68-
# Given the following candidates for the `C` parameter, find out the impact of
69-
# `C` on the classifier decision boundary. You can use
70-
# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
71-
# decision function boundary.
104+
# Which of the following pairs of features is most impacting the predictions of
105+
# the logistic regression classifier based on the absolute magnitude of its
106+
# coefficients?
72107

73108
# %%
74-
Cs = [0.01, 0.1, 1, 10]
109+
# Write your code here.
110+
111+
# %% [markdown]
112+
# Now create a similar pipeline consisting of the same preprocessor as above,
113+
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
114+
# Set `degree=2` and `interaction_only=True` to the feature engineering step.
115+
# Remember not to include a "bias" feature to avoid introducing a redundancy
116+
# with the intercept of the subsequent logistic regression.
75117

118+
# %%
76119
# Write your code here.
77120

78121
# %% [markdown]
79-
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
122+
# By comparing the cross-validation test scores of both models fold-to-fold,
123+
# count the number of times the model using multiplicative interactions and both
124+
# numerical and categorical features has a better test score than the model
125+
# without interactions.
126+
127+
# %%
128+
# Write your code here.
80129

81130
# %%
82131
# Write your code here.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# ---
2+
# jupyter:
3+
# jupytext:
4+
# text_representation:
5+
# extension: .py
6+
# format_name: percent
7+
# format_version: '1.3'
8+
# jupytext_version: 1.15.2
9+
# kernelspec:
10+
# display_name: Python 3
11+
# name: python3
12+
# ---
13+
14+
# %% [markdown]
15+
# # 📝 Exercise M4.04
16+
#
17+
# In the previous Module we tuned the hyperparameter `C` of the logistic
18+
# regression without mentioning that it controls the regularization strength.
19+
# Later, on the slides on 🎥 **Intuitions on regularized linear models** we
20+
# metioned that a small `C` provides a more regularized model, whereas a
21+
# non-regularized model is obtained with an infinitely large value of `C`.
22+
# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
23+
# model.
24+
#
25+
# In this exercise, we ask you to train a logistic regression classifier using
26+
# different values of the parameter `C` to find its effects by yourself.
27+
#
28+
# We start by loading the dataset. We only keep the Adelie and Chinstrap classes
29+
# to keep the discussion simple.
30+
31+
32+
# %% [markdown]
33+
# ```{note}
34+
# If you want a deeper overview regarding this dataset, you can refer to the
35+
# Appendix - Datasets description section at the end of this MOOC.
36+
# ```
37+
38+
# %%
39+
import pandas as pd
40+
41+
penguins = pd.read_csv("../datasets/penguins_classification.csv")
42+
penguins = (
43+
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
44+
)
45+
46+
culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
47+
target_column = "Species"
48+
49+
# %%
50+
from sklearn.model_selection import train_test_split
51+
52+
penguins_train, penguins_test = train_test_split(
53+
penguins, random_state=0, test_size=0.4
54+
)
55+
56+
data_train = penguins_train[culmen_columns]
57+
data_test = penguins_test[culmen_columns]
58+
59+
target_train = penguins_train[target_column]
60+
target_test = penguins_test[target_column]
61+
62+
# %% [markdown]
63+
# We define a function to help us fit a given `model` and plot its decision
64+
# boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging
65+
# colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped
66+
# to the white color. Equivalently, the darker the color, the closer the
67+
# predicted probability is to 0 or 1 and the more confident the classifier is in
68+
# its predictions.
69+
70+
# %%
71+
import matplotlib.pyplot as plt
72+
import seaborn as sns
73+
from sklearn.inspection import DecisionBoundaryDisplay
74+
75+
76+
def plot_decision_boundary(model):
77+
model.fit(data_train, target_train)
78+
accuracy = model.score(data_test, target_test)
79+
C = model.get_params()["logisticregression__C"]
80+
81+
disp = DecisionBoundaryDisplay.from_estimator(
82+
model,
83+
data_train,
84+
response_method="predict_proba",
85+
plot_method="pcolormesh",
86+
cmap="RdBu_r",
87+
alpha=0.8,
88+
vmin=0.0,
89+
vmax=1.0,
90+
)
91+
DecisionBoundaryDisplay.from_estimator(
92+
model,
93+
data_train,
94+
response_method="predict_proba",
95+
plot_method="contour",
96+
linestyles="--",
97+
linewidths=1,
98+
alpha=0.8,
99+
levels=[0.5],
100+
ax=disp.ax_,
101+
)
102+
sns.scatterplot(
103+
data=penguins_train,
104+
x=culmen_columns[0],
105+
y=culmen_columns[1],
106+
hue=target_column,
107+
palette=["tab:blue", "tab:red"],
108+
ax=disp.ax_,
109+
)
110+
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
111+
plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")
112+
113+
114+
# %% [markdown]
115+
# Let's now create our predictive model.
116+
117+
# %%
118+
from sklearn.pipeline import make_pipeline
119+
from sklearn.preprocessing import StandardScaler
120+
from sklearn.linear_model import LogisticRegression
121+
122+
logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())
123+
124+
# %% [markdown]
125+
# ## Influence of the parameter `C` on the decision boundary
126+
#
127+
# Given the following candidates for the `C` parameter and the
128+
# `plot_decision_boundary` function, find out the impact of `C` on the
129+
# classifier's decision boundary.
130+
#
131+
# - How does the value of `C` impact the confidence on the predictions?
132+
# - How does it impact the underfit/overfit trade-off?
133+
# - How does it impact the position and orientation of the decision boundary?
134+
#
135+
# Try to give an interpretation on the reason for such behavior.
136+
137+
# %%
138+
Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]
139+
140+
# Write your code here.
141+
142+
# %% [markdown]
143+
# ## Impact of the regularization on the weights
144+
#
145+
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
146+
# **Hint**: You can [access pipeline
147+
# steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)
148+
# by name or position. Then you can query the attributes of that step such as
149+
# `coef_`.
150+
151+
# %%
152+
# Write your code here.
153+
154+
# %% [markdown]
155+
# ## Impact of the regularization on with non-linear feature engineering
156+
#
157+
# Use the `plot_decision_boundary` function to repeat the experiment using a
158+
# non-linear feature engineering pipeline. For such purpose, insert
159+
# `Nystroem(kernel="rbf", gamma=1, n_components=100)` between the
160+
# `StandardScaler` and the `LogisticRegression` steps.
161+
#
162+
# - Does the value of `C` still impact the position of the decision boundary and
163+
# the confidence of the model?
164+
# - What can you say about the impact of `C` on the underfitting vs overfitting
165+
# trade-off?
166+
167+
# %%
168+
from sklearn.kernel_approximation import Nystroem
169+
170+
# Write your code here.

0 commit comments

Comments
 (0)