INRIA · ArturoAmorQ · Oct 27, 2023 · Oct 26, 2023 · Oct 26, 2023 · Oct 27, 2023
diff --git a/notebooks/linear_models_ex_03.ipynb b/notebooks/linear_models_ex_03.ipynb
@@ -6,25 +6,36 @@
    "source": [
     "# \ud83d\udcdd Exercise M4.03\n",
     "\n",
-    "The parameter `penalty` can control the **type** of regularization to use,\n",
-    "whereas the regularization **strength** is set using the parameter `C`.\n",
-    "Setting`penalty=\"none\"` is equivalent to an infinitely large value of `C`. In\n",
-    "this exercise, we ask you to train a logistic regression classifier using the\n",
-    "`penalty=\"l2\"` regularization (which happens to be the default in\n",
-    "scikit-learn) to find by yourself the effect of the parameter `C`.\n",
+    "Now, we tackle a more realistic classification problem instead of making a\n",
+    "synthetic dataset. We start by loading the Adult Census dataset with the\n",
+    "following snippet. For the moment we retain only the **numerical features**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
     "\n",
-    "We start by loading the dataset."
+    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
+    "target = adult_census[\"class\"]\n",
+    "data = adult_census.select_dtypes([\"integer\", \"floating\"])\n",
+    "data = data.drop(columns=[\"education-num\"])\n",
+    "data"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "<div class=\"admonition note alert alert-info\">\n",
-    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
-    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
-    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
-    "</div>"
+    "We confirm that all the selected features are numerical.\n",
+    "\n",
+    "Compute the generalization performance in terms of accuracy of a linear model\n",
+    "composed of a `StandardScaler` and a `LogisticRegression`. Use a 10-fold\n",
+    "cross-validation with `return_estimator=True` to be able to inspect the\n",
+    "trained estimators."
    ]
   },
   {
@@ -33,16 +44,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import pandas as pd\n",
-    "\n",
-    "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n",
-    "# only keep the Adelie and Chinstrap classes\n",
-    "penguins = (\n",
-    "    penguins.set_index(\"Species\").loc[[\"Adelie\", \"Chinstrap\"]].reset_index()\n",
-    ")\n",
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "What is the most important feature seen by the logistic regression?\n",
     "\n",
-    "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
-    "target_column = \"Species\""
+    "You can use a boxplot to compare the absolute values of the coefficients while\n",
+    "also visualizing the variability induced by the cross-validation resampling."
    ]
   },
   {
@@ -51,22 +63,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from sklearn.model_selection import train_test_split\n",
-    "\n",
-    "penguins_train, penguins_test = train_test_split(penguins, random_state=0)\n",
-    "\n",
-    "data_train = penguins_train[culmen_columns]\n",
-    "data_test = penguins_test[culmen_columns]\n",
-    "\n",
-    "target_train = penguins_train[target_column]\n",
-    "target_test = penguins_test[target_column]"
+    "# Write your code here."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, let's create our predictive model."
+    "Let's now work with **both numerical and categorical features**. You can\n",
+    "reload the Adult Census dataset with the following snippet:"
    ]
   },
   {
@@ -75,23 +80,42 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from sklearn.pipeline import make_pipeline\n",
-    "from sklearn.preprocessing import StandardScaler\n",
-    "from sklearn.linear_model import LogisticRegression\n",
+    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")\n",
+    "target = adult_census[\"class\"]\n",
+    "data = adult_census.drop(columns=[\"class\", \"education-num\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Create a predictive model where:\n",
+    "- The numerical data must be scaled.\n",
+    "- The categorical data must be one-hot encoded, set `min_frequency=0.01` to\n",
+    "  group categories concerning less than 1% of the total samples.\n",
+    "- The predictor is a `LogisticRegression`. You may need to increase the number\n",
+    "  of `max_iter`, which is 100 by default.\n",
     "\n",
-    "logistic_regression = make_pipeline(\n",
-    "    StandardScaler(), LogisticRegression(penalty=\"l2\")\n",
-    ")"
+    "Use the same 10-fold cross-validation strategy with `return_estimator=True` as\n",
+    "above to evaluate this complex pipeline."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Given the following candidates for the `C` parameter, find out the impact of\n",
-    "`C` on the classifier decision boundary. You can use\n",
-    "`sklearn.inspection.DecisionBoundaryDisplay.from_estimator` to plot the\n",
-    "decision function boundary."
+    "By comparing the cross-validation test scores of both models fold-to-fold,\n",
+    "count the number of times the model using both numerical and categorical\n",
+    "features has a better test score than the model using only numerical features."
    ]
   },
   {
@@ -100,16 +124,100 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "Cs = [0.01, 0.1, 1, 10]\n",
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For the following questions, you can copy adn paste the following snippet to\n",
+    "get the feature names from the column transformer here named `preprocessor`.\n",
     "\n",
+    "```python\n",
+    "preprocessor.fit(data)\n",
+    "feature_names = (\n",
+    "    preprocessor.named_transformers_[\"onehotencoder\"].get_feature_names_out(\n",
+    "        categorical_columns\n",
+    "    )\n",
+    ").tolist()\n",
+    "feature_names += numerical_columns\n",
+    "feature_names\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Notice that there are as many feature names as coefficients in the last step\n",
+    "of your predictive pipeline."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Which of the following pairs of features is most impacting the predictions of\n",
+    "the logistic regression classifier based on the absolute magnitude of its\n",
+    "coefficients?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "# Write your code here."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Look at the impact of the `C` hyperparameter on the magnitude of the weights."
+    "Now create a similar pipeline consisting of the same preprocessor as above,\n",
+    "followed by a `PolynomialFeatures` and a logistic regression with `C=0.01`.\n",
+    "Set `degree=2` and `interaction_only=True` to the feature engineering step.\n",
+    "Remember not to include a \"bias\" feature to avoid introducing a redundancy\n",
+    "with the intercept of the subsequent logistic regression."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By comparing the cross-validation test scores of both models fold-to-fold,\n",
+    "count the number of times the model using multiplicative interactions and both\n",
+    "numerical and categorical features has a better test score than the model\n",
+    "without interactions."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Write your code here."
    ]
   },
   {