-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Convert some of the Wrap-up M4 content into exercise #731
Merged
+759
−481
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
846a9f6
ENH Convert some of the Wrap-up M4 content into exercise
280c4c9
Fix CI
c97f7ec
Remove content related to regularization
4a6e132
Exercise M4.04 fixes
480e401
Exercise M4.03 fixes
b704279
Iter
cc62433
Build exercises from solutions
afba33f
Add F821 failure to the ignore list in flake8 config
9518a17
Revert add F821 failure to the ignore list in flake8 config
c472672
Use min_frequency to trim number of categorical features
a479fcd
Merge branch 'main' of github.com:INRIA/scikit-learn-mooc into move_W…
c4fed31
Solve conflicts
f2e4f01
Revert unnecessary change
ArturoAmorQ 28986c2
Better way to avoid flake8 undefined name error
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,170 @@ | ||
# --- | ||
# jupyter: | ||
# jupytext: | ||
# text_representation: | ||
# extension: .py | ||
# format_name: percent | ||
# format_version: '1.3' | ||
# jupytext_version: 1.15.2 | ||
# kernelspec: | ||
# display_name: Python 3 | ||
# name: python3 | ||
# --- | ||
|
||
# %% [markdown] | ||
# # 📝 Exercise M4.04 | ||
# | ||
# In the previous Module we tuned the hyperparameter `C` of the logistic | ||
# regression without mentioning that it controls the regularization strength. | ||
# Later, on the slides on 🎥 **Intuitions on regularized linear models** we | ||
# metioned that a small `C` provides a more regularized model, whereas a | ||
# non-regularized model is obtained with an infinitely large value of `C`. | ||
# Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge` | ||
# model. | ||
# | ||
# In this exercise, we ask you to train a logistic regression classifier using | ||
# different values of the parameter `C` to find its effects by yourself. | ||
# | ||
# We start by loading the dataset. We only keep the Adelie and Chinstrap classes | ||
# to keep the discussion simple. | ||
|
||
|
||
# %% [markdown] | ||
# ```{note} | ||
# If you want a deeper overview regarding this dataset, you can refer to the | ||
# Appendix - Datasets description section at the end of this MOOC. | ||
# ``` | ||
|
||
# %% | ||
import pandas as pd | ||
|
||
penguins = pd.read_csv("../datasets/penguins_classification.csv") | ||
penguins = ( | ||
penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() | ||
) | ||
|
||
culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] | ||
target_column = "Species" | ||
|
||
# %% | ||
from sklearn.model_selection import train_test_split | ||
|
||
penguins_train, penguins_test = train_test_split( | ||
penguins, random_state=0, test_size=0.4 | ||
) | ||
|
||
data_train = penguins_train[culmen_columns] | ||
data_test = penguins_test[culmen_columns] | ||
|
||
target_train = penguins_train[target_column] | ||
target_test = penguins_test[target_column] | ||
|
||
# %% [markdown] | ||
# We define a function to help us fit a given `model` and plot its decision | ||
# boundary. We recall that by using a `DecisionBoundaryDisplay` with diverging | ||
# colormap, `vmin=0` and `vmax=1`, we ensure that the 0.5 probability is mapped | ||
# to the white color. Equivalently, the darker the color, the closer the | ||
# predicted probability is to 0 or 1 and the more confident the classifier is in | ||
# its predictions. | ||
|
||
# %% | ||
import matplotlib.pyplot as plt | ||
import seaborn as sns | ||
from sklearn.inspection import DecisionBoundaryDisplay | ||
|
||
|
||
def plot_decision_boundary(model): | ||
model.fit(data_train, target_train) | ||
accuracy = model.score(data_test, target_test) | ||
C = model.get_params()["logisticregression__C"] | ||
|
||
disp = DecisionBoundaryDisplay.from_estimator( | ||
model, | ||
data_train, | ||
response_method="predict_proba", | ||
plot_method="pcolormesh", | ||
cmap="RdBu_r", | ||
alpha=0.8, | ||
vmin=0.0, | ||
vmax=1.0, | ||
) | ||
DecisionBoundaryDisplay.from_estimator( | ||
model, | ||
data_train, | ||
response_method="predict_proba", | ||
plot_method="contour", | ||
linestyles="--", | ||
linewidths=1, | ||
alpha=0.8, | ||
levels=[0.5], | ||
ax=disp.ax_, | ||
) | ||
sns.scatterplot( | ||
data=penguins_train, | ||
x=culmen_columns[0], | ||
y=culmen_columns[1], | ||
hue=target_column, | ||
palette=["tab:blue", "tab:red"], | ||
ax=disp.ax_, | ||
) | ||
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left") | ||
plt.title(f"C: {C} \n Accuracy on the test set: {accuracy:.2f}") | ||
|
||
|
||
# %% [markdown] | ||
# Let's now create our predictive model. | ||
|
||
# %% | ||
from sklearn.pipeline import make_pipeline | ||
from sklearn.preprocessing import StandardScaler | ||
from sklearn.linear_model import LogisticRegression | ||
|
||
logistic_regression = make_pipeline(StandardScaler(), LogisticRegression()) | ||
|
||
# %% [markdown] | ||
# ## Influence of the parameter `C` on the decision boundary | ||
# | ||
# Given the following candidates for the `C` parameter and the | ||
# `plot_decision_boundary` function, find out the impact of `C` on the | ||
# classifier's decision boundary. | ||
# | ||
# - How does the value of `C` impact the confidence on the predictions? | ||
# - How does it impact the underfit/overfit trade-off? | ||
# - How does it impact the position and orientation of the decision boundary? | ||
# | ||
# Try to give an interpretation on the reason for such behavior. | ||
|
||
# %% | ||
Cs = [1e-6, 0.01, 0.1, 1, 10, 100, 1e6] | ||
|
||
# Write your code here. | ||
|
||
# %% [markdown] | ||
# ## Impact of the regularization on the weights | ||
# | ||
# Look at the impact of the `C` hyperparameter on the magnitude of the weights. | ||
# **Hint**: You can [access pipeline | ||
# steps](https://scikit-learn.org/stable/modules/compose.html#access-pipeline-steps) | ||
# by name or position. Then you can query the attributes of that step such as | ||
# `coef_`. | ||
|
||
# %% | ||
# Write your code here. | ||
|
||
# %% [markdown] | ||
# ## Impact of the regularization on with non-linear feature engineering | ||
# | ||
# Use the `plot_decision_boundary` function to repeat the experiment using a | ||
# non-linear feature engineering pipeline. For such purpose, insert | ||
# `Nystroem(kernel="rbf", gamma=1, n_components=100)` between the | ||
# `StandardScaler` and the `LogisticRegression` steps. | ||
# | ||
# - Does the value of `C` still impact the position of the decision boundary and | ||
# the confidence of the model? | ||
# - What can you say about the impact of `C` on the underfitting vs overfitting | ||
# trade-off? | ||
|
||
# %% | ||
from sklearn.kernel_approximation import Nystroem | ||
|
||
# Write your code here. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For info: I had to comment these lines by hand as it was rising a flake8 error
F821 undefined name
when building the exercise from the solution.We may need to think of a better way to avoid this situation in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add the
F821
failure to the ignore list in the flake8 configuration. Since we run all the code of the notebooks, including the solutions when building the jupyterbook we should be safe. The only code we do not check automatically are the solutions to the wrap up quiz but they are in the private repo.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue here is that
preprocessor
is defined in the solution but not in the exercise. So I think it will raise an error when building the jupyterbook.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. We can keep it that way then.
Maybe add a comment to state to reuse the
preprocessor
variable defined in the solution of the previous question.