|
14 | 14 | # %% [markdown]
|
15 | 15 | # # 📝 Exercise M4.03
|
16 | 16 | #
|
17 |
| -# The parameter `penalty` can control the **type** of regularization to use, |
18 |
| -# whereas the regularization **strength** is set using the parameter `C`. |
19 |
| -# Setting`penalty="none"` is equivalent to an infinitely large value of `C`. In |
20 |
| -# this exercise, we ask you to train a logistic regression classifier using the |
21 |
| -# `penalty="l2"` regularization (which happens to be the default in |
22 |
| -# scikit-learn) to find by yourself the effect of the parameter `C`. |
23 |
| -# |
24 |
| -# We start by loading the dataset. |
| 17 | +# Now, we tackle a more realistic classification problem instead of making a |
| 18 | +# synthetic dataset. We start by loading the Adult Census dataset with the |
| 19 | +# following snippet. For the moment we retain only the **numerical features**. |
| 20 | + |
| 21 | +# %% |
| 22 | +import pandas as pd |
| 23 | + |
| 24 | +adult_census = pd.read_csv("../datasets/adult-census.csv") |
| 25 | +target = adult_census["class"] |
| 26 | +data = adult_census.select_dtypes(["integer", "floating"]) |
| 27 | +data = data.drop(columns=["education-num"]) |
| 28 | +data |
25 | 29 |
|
26 | 30 | # %% [markdown]
|
27 |
| -# ```{note} |
28 |
| -# If you want a deeper overview regarding this dataset, you can refer to the |
29 |
| -# Appendix - Datasets description section at the end of this MOOC. |
30 |
| -# ``` |
| 31 | +# We confirm that all the selected features are numerical. |
| 32 | +# |
| 33 | +# Compute the generalization performance in terms of accuracy of a linear model |
| 34 | +# composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold |
| 35 | +# cross-validation with `return_estimator=True` to be able to inspect the |
| 36 | +# trained estimators. |
31 | 37 |
|
32 | 38 | # %%
|
33 |
| -import pandas as pd |
| 39 | +# Write your code here. |
34 | 40 |
|
35 |
| -penguins = pd.read_csv("../datasets/penguins_classification.csv") |
36 |
| -# only keep the Adelie and Chinstrap classes |
37 |
| -penguins = ( |
38 |
| - penguins.set_index("Species").loc[["Adelie", "Chinstrap"]].reset_index() |
39 |
| -) |
| 41 | +# %% [markdown] |
| 42 | +# What is the most important feature seen by the logistic regression? |
| 43 | +# |
| 44 | +# You can use a boxplot to compare the absolute values of the coefficients while |
| 45 | +# also visualizing the variability induced by the cross-validation resampling. |
| 46 | + |
| 47 | +# %% |
| 48 | +# Write your code here. |
40 | 49 |
|
41 |
| -culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"] |
42 |
| -target_column = "Species" |
| 50 | +# %% [markdown] |
| 51 | +# Let's now work with **both numerical and categorical features**. You can |
| 52 | +# reload the Adult Census dataset with the following snippet: |
43 | 53 |
|
44 | 54 | # %%
|
45 |
| -from sklearn.model_selection import train_test_split |
| 55 | +adult_census = pd.read_csv("../datasets/adult-census.csv") |
| 56 | +target = adult_census["class"] |
| 57 | +data = adult_census.drop(columns=["class", "education-num"]) |
| 58 | + |
| 59 | +# %% [markdown] |
| 60 | +# Create a predictive model where: |
| 61 | +# - The numerical data must be scaled. |
| 62 | +# - The categorical data must be one-hot encoded, set `min_frequency=0.01` to |
| 63 | +# group categories concerning less than 1% of the total samples. |
| 64 | +# - The predictor is a `LogisticRegression`. You may need to increase the number |
| 65 | +# of `max_iter`, which is 100 by default. |
| 66 | +# |
| 67 | +# Use the same 10-fold cross-validation strategy with `return_estimator=True` as |
| 68 | +# above to evaluate this complex pipeline. |
46 | 69 |
|
47 |
| -penguins_train, penguins_test = train_test_split(penguins, random_state=0) |
| 70 | +# %% |
| 71 | +# Write your code here. |
48 | 72 |
|
49 |
| -data_train = penguins_train[culmen_columns] |
50 |
| -data_test = penguins_test[culmen_columns] |
| 73 | +# %% [markdown] |
| 74 | +# By comparing the cross-validation test scores of both models fold-to-fold, |
| 75 | +# count the number of times the model using both numerical and categorical |
| 76 | +# features has a better test score than the model using only numerical features. |
51 | 77 |
|
52 |
| -target_train = penguins_train[target_column] |
53 |
| -target_test = penguins_test[target_column] |
| 78 | +# %% |
| 79 | +# Write your code here. |
54 | 80 |
|
55 | 81 | # %% [markdown]
|
56 |
| -# First, let's create our predictive model. |
| 82 | +# For the following questions, you can copy adn paste the following snippet to |
| 83 | +# get the feature names from the column transformer here named `preprocessor`. |
| 84 | +# |
| 85 | +# ```python |
| 86 | +# preprocessor.fit(data) |
| 87 | +# feature_names = ( |
| 88 | +# preprocessor.named_transformers_["onehotencoder"].get_feature_names_out( |
| 89 | +# categorical_columns |
| 90 | +# ) |
| 91 | +# ).tolist() |
| 92 | +# feature_names += numerical_columns |
| 93 | +# feature_names |
| 94 | +# ``` |
57 | 95 |
|
58 | 96 | # %%
|
59 |
| -from sklearn.pipeline import make_pipeline |
60 |
| -from sklearn.preprocessing import StandardScaler |
61 |
| -from sklearn.linear_model import LogisticRegression |
| 97 | +# Write your code here. |
62 | 98 |
|
63 |
| -logistic_regression = make_pipeline( |
64 |
| - StandardScaler(), LogisticRegression(penalty="l2") |
65 |
| -) |
| 99 | +# %% [markdown] |
| 100 | +# Notice that there are as many feature names as coefficients in the last step |
| 101 | +# of your predictive pipeline. |
66 | 102 |
|
67 | 103 | # %% [markdown]
|
68 |
| -# Given the following candidates for the `C` parameter, find out the impact of |
69 |
| -# `C` on the classifier decision boundary. You can use |
70 |
| -# `sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the |
71 |
| -# decision function boundary. |
| 104 | +# Which of the following pairs of features is most impacting the predictions of |
| 105 | +# the logistic regression classifier based on the absolute magnitude of its |
| 106 | +# coefficients? |
72 | 107 |
|
73 | 108 | # %%
|
74 |
| -Cs = [0.01, 0.1, 1, 10] |
| 109 | +# Write your code here. |
| 110 | + |
| 111 | +# %% [markdown] |
| 112 | +# Now create a similar pipeline consisting of the same preprocessor as above, |
| 113 | +# followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`. |
| 114 | +# Set `degree=2` and `interaction_only=True` to the feature engineering step. |
| 115 | +# Remember not to include a "bias" feature to avoid introducing a redundancy |
| 116 | +# with the intercept of the subsequent logistic regression. |
75 | 117 |
|
| 118 | +# %% |
76 | 119 | # Write your code here.
|
77 | 120 |
|
78 | 121 | # %% [markdown]
|
79 |
| -# Look at the impact of the `C` hyperparameter on the magnitude of the weights. |
| 122 | +# By comparing the cross-validation test scores of both models fold-to-fold, |
| 123 | +# count the number of times the model using multiplicative interactions and both |
| 124 | +# numerical and categorical features has a better test score than the model |
| 125 | +# without interactions. |
| 126 | + |
| 127 | +# %% |
| 128 | +# Write your code here. |
80 | 129 |
|
81 | 130 | # %%
|
82 | 131 | # Write your code here.
|
0 commit comments