Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Convert some of the Wrap-up M4 content into exercise #731

Merged
merged 14 commits into from
Oct 27, 2023
7 changes: 4 additions & 3 deletions jupyter-book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -102,14 +102,15 @@ parts:
- file: python_scripts/linear_models_ex_02
- file: python_scripts/linear_models_sol_02
- file: python_scripts/linear_models_feature_engineering_classification.py
- file: python_scripts/logistic_regression_non_linear
- file: python_scripts/linear_models_ex_03
- file: python_scripts/linear_models_sol_03
- file: linear_models/linear_models_quiz_m4_02
- file: linear_models/linear_models_regularization_index
sections:
- file: linear_models/regularized_linear_models_slides
- file: python_scripts/linear_models_regularization
- file: python_scripts/linear_models_ex_03
- file: python_scripts/linear_models_sol_03
- file: python_scripts/linear_models_ex_04
- file: python_scripts/linear_models_sol_04
- file: linear_models/linear_models_quiz_m4_03
- file: linear_models/linear_models_wrap_up_quiz
- file: linear_models/linear_models_module_take_away
Expand Down
127 changes: 88 additions & 39 deletions python_scripts/linear_models_ex_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,69 +14,118 @@
# %% [markdown]
# # 📝 Exercise M4.03
#
# The parameter `penalty` can control the **type** of regularization to use,
# whereas the regularization **strength** is set using the parameter `C`.
# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In
# this exercise, we ask you to train a logistic regression classifier using the
# `penalty="l2"` regularization (which happens to be the default in
# scikit-learn) to find by yourself the effect of the parameter `C`.
#
# We start by loading the dataset.
# Now, we tackle a more realistic classification problem instead of making a
# synthetic dataset. We start by loading the Adult Census dataset with the
# following snippet. For the moment we retain only the **numerical features**.

# %%
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.select_dtypes(["integer", "floating"])
data = data.drop(columns=["education-num"])
data

# %% [markdown]
# ```{note}
# If you want a deeper overview regarding this dataset, you can refer to the
# Appendix - Datasets description section at the end of this MOOC.
# ```
# We confirm that all the selected features are numerical.
#
# Compute the generalization performance in terms of accuracy of a linear model
# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold
# cross-validation with `return_estimator=True` to be able to inspect the
# trained estimators.

# %%
import pandas as pd
# Write your code here.

penguins = pd.read_csv("../datasets/penguins_classification.csv")
# only keep the Adelie and Chinstrap classes
penguins = (
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
)
# %% [markdown]
# What is the most important feature seen by the logistic regression?
#
# You can use a boxplot to compare the absolute values of the coefficients while
# also visualizing the variability induced by the cross-validation resampling.

# %%
# Write your code here.

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"
# %% [markdown]
# Let's now work with **both numerical and categorical features**. You can
# reload the Adult Census dataset with the following snippet:

# %%
from sklearn.model_selection import train_test_split
adult_census = pd.read_csv("../datasets/adult-census.csv")
target = adult_census["class"]
data = adult_census.drop(columns=["class", "education-num"])

# %% [markdown]
# Create a predictive model where:
# - The numerical data must be scaled.
# - The categorical data must be one-hot encoded, set `min_frequency=0.01` to
# group categories concerning less than 1% of the total samples.
# - The predictor is a `LogisticRegression`. You may need to increase the number
# of `max_iter`, which is 100 by default.
#
# Use the same 10-fold cross-validation strategy with `return_estimator=True` as
# above to evaluate this complex pipeline.

penguins_train, penguins_test = train_test_split(penguins, random_state=0)
# %%
# Write your code here.

data_train = penguins_train[culmen_columns]
data_test = penguins_test[culmen_columns]
# %% [markdown]
# By comparing the cross-validation test scores of both models fold-to-fold,
# count the number of times the model using both numerical and categorical
# features has a better test score than the model using only numerical features.

target_train = penguins_train[target_column]
target_test = penguins_test[target_column]
# %%
# Write your code here.

# %% [markdown]
# First, let's create our predictive model.
# For the following questions, you can copy adn paste the following snippet to
# get the feature names from the column transformer here named `preprocessor`.
#
# ```python
# preprocessor.fit(data)
# feature_names = (
# preprocessor.named_transformers_["onehotencoder"].get_feature_names_out(
# categorical_columns
# )
# ).tolist()
# feature_names += numerical_columns
# feature_names
Comment on lines +86 to +93
Copy link
Collaborator Author

@ArturoAmorQ ArturoAmorQ Oct 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For info: I had to comment these lines by hand as it was rising a flake8 error F821 undefined name when building the exercise from the solution.

We may need to think of a better way to avoid this situation in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add the F821 failure to the ignore list in the flake8 configuration. Since we run all the code of the notebooks, including the solutions when building the jupyterbook we should be safe. The only code we do not check automatically are the solutions to the wrap up quiz but they are in the private repo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that preprocessor is defined in the solution but not in the exercise. So I think it will raise an error when building the jupyterbook.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. We can keep it that way then.

Maybe add a comment to state to reuse the preprocessor variable defined in the solution of the previous question.

# ```

# %%
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
# Write your code here.

logistic_regression = make_pipeline(
StandardScaler(), LogisticRegression(penalty="l2")
)
# %% [markdown]
# Notice that there are as many feature names as coefficients in the last step
# of your predictive pipeline.

# %% [markdown]
# Given the following candidates for the `C` parameter, find out the impact of
# `C` on the classifier decision boundary. You can use
# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the
# decision function boundary.
# Which of the following pairs of features is most impacting the predictions of
# the logistic regression classifier based on the absolute magnitude of its
# coefficients?

# %%
Cs = [0.01, 0.1, 1, 10]
# Write your code here.

# %% [markdown]
# Now create a similar pipeline consisting of the same preprocessor as above,
# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.
# Set `degree=2` and `interaction_only=True` to the feature engineering step.
# Remember not to include a "bias" feature to avoid introducing a redundancy
# with the intercept of the subsequent logistic regression.

# %%
# Write your code here.

# %% [markdown]
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# By comparing the cross-validation test scores of both models fold-to-fold,
# count the number of times the model using multiplicative interactions and both
# numerical and categorical features has a better test score than the model
# without interactions.

# %%
# Write your code here.

# %%
# Write your code here.
170 changes: 170 additions & 0 deletions python_scripts/linear_models_ex_04.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.15.2
# kernelspec:
# display_name: Python 3
# name: python3
# ---

# %% [markdown]
# # 📝 Exercise M4.04
#
# In the previous Module we tuned the hyperparameter `C` of the logistic
# regression without mentioning that it controls the regularization strength.
# Later, on the slides on 🎥 **Intuitions on regularized linear models** we
# metioned that a small `C` provides a more regularized model, whereas a
# non-regularized model is obtained with an infinitely large value of `C`.
# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
# model.
#
# In this exercise, we ask you to train a logistic regression classifier using
# different values of the parameter `C` to find its effects by yourself.
#
# We start by loading the dataset. We only keep the Adelie and Chinstrap classes
# to keep the discussion simple.


# %% [markdown]
# ```{note}
# If you want a deeper overview regarding this dataset, you can refer to the
# Appendix - Datasets description section at the end of this MOOC.
# ```

# %%
import pandas as pd

penguins = pd.read_csv("../datasets/penguins_classification.csv")
penguins = (
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index()
)

culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"

# %%
from sklearn.model_selection import train_test_split

penguins_train, penguins_test = train_test_split(
penguins, random_state=0, test_size=0.4
)

data_train = penguins_train[culmen_columns]
data_test = penguins_test[culmen_columns]

target_train = penguins_train[target_column]
target_test = penguins_test[target_column]

# %% [markdown]
# We define a function to help us fit a given `model` and plot its decision
# boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging
# colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped
# to the white color. Equivalently, the darker the color, the closer the
# predicted probability is to 0 or 1 and the more confident the classifier is in
# its predictions.

# %%
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.inspection import DecisionBoundaryDisplay


def plot_decision_boundary(model):
model.fit(data_train, target_train)
accuracy = model.score(data_test, target_test)
C = model.get_params()["logisticregression__C"]

disp = DecisionBoundaryDisplay.from_estimator(
model,
data_train,
response_method="predict_proba",
plot_method="pcolormesh",
cmap="RdBu_r",
alpha=0.8,
vmin=0.0,
vmax=1.0,
)
DecisionBoundaryDisplay.from_estimator(
model,
data_train,
response_method="predict_proba",
plot_method="contour",
linestyles="--",
linewidths=1,
alpha=0.8,
levels=[0.5],
ax=disp.ax_,
)
sns.scatterplot(
data=penguins_train,
x=culmen_columns[0],
y=culmen_columns[1],
hue=target_column,
palette=["tab:blue", "tab:red"],
ax=disp.ax_,
)
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}")


# %% [markdown]
# Let's now create our predictive model.

# %%
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

logistic_regression = make_pipeline(StandardScaler(), LogisticRegression())

# %% [markdown]
# ## Influence of the parameter `C` on the decision boundary
#
# Given the following candidates for the `C` parameter and the
# `plot_decision_boundary` function, find out the impact of `C` on the
# classifier's decision boundary.
#
# - How does the value of `C` impact the confidence on the predictions?
# - How does it impact the underfit/overfit trade-off?
# - How does it impact the position and orientation of the decision boundary?
#
# Try to give an interpretation on the reason for such behavior.

# %%
Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6]

# Write your code here.

# %% [markdown]
# ## Impact of the regularization on the weights
#
# Look at the impact of the `C` hyperparameter on the magnitude of the weights.
# **Hint**: You can [access pipeline
# steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps)
# by name or position. Then you can query the attributes of that step such as
# `coef_`.

# %%
# Write your code here.

# %% [markdown]
# ## Impact of the regularization on with non-linear feature engineering
#
# Use the `plot_decision_boundary` function to repeat the experiment using a
# non-linear feature engineering pipeline. For such purpose, insert
# `Nystroem(kernel="rbf", gamma=1, n_components=100)` between the
# `StandardScaler` and the `LogisticRegression` steps.
#
# - Does the value of `C` still impact the position of the decision boundary and
# the confidence of the model?
# - What can you say about the impact of `C` on the underfitting vs overfitting
# trade-off?

# %%
from sklearn.kernel_approximation import Nystroem

# Write your code here.
Loading