[ci skip] MAINT Fix typos and wording across the mooc (#764)

Co-authored-by: ArturoAmorQ <[email protected]> 1237225
INRIA · Apr 26, 2024 · 3ac8ce4 · 3ac8ce4
1 parent e60b9bd
commit 3ac8ce4
Show file tree

Hide file tree

Showing 19 changed files with 320 additions and 304 deletions.
diff --git a/_sources/python_scripts/02_numerical_pipeline_introduction.py b/_sources/python_scripts/02_numerical_pipeline_introduction.py
@@ -59,7 +59,7 @@
 data
 
 # %% [markdown]
-# We can now linger on the variables, also denominated features, that we later
+# We can now focus on the variables, also denominated features, that we later
 # use to build our predictive model. In addition, we can also check how many
 # samples are available in our dataset.
 

diff --git a/_sources/python_scripts/03_categorical_pipeline.py b/_sources/python_scripts/03_categorical_pipeline.py
@@ -253,7 +253,7 @@
 # and check the generalization performance of this machine learning pipeline using
 # cross-validation.
 #
-# Before we create the pipeline, we have to linger on the `native-country`.
+# Before we create the pipeline, we have to focus on the `native-country`.
 # Let's recall some statistics regarding this column.
 
 # %%
@@ -329,9 +329,10 @@
 print(f"The accuracy is: {scores.mean():.3f} ± {scores.std():.3f}")
 
 # %% [markdown]
-# As you can see, this representation of the categorical variables is
-# slightly more predictive of the revenue than the numerical variables
-# that we used previously.
+# As you can see, this representation of the categorical variables is slightly
+# more predictive of the revenue than the numerical variables that we used
+# previously. The reason being that we have more (predictive) categorical
+# features than numerical ones.
 
 # %% [markdown]
 #

diff --git a/_sources/python_scripts/cross_validation_train_test.py b/_sources/python_scripts/cross_validation_train_test.py
@@ -12,7 +12,7 @@
 # of predictive models. While this section could be slightly redundant, we
 # intend to go into details into the cross-validation framework.
 #
-# Before we dive in, let's linger on the reasons for always having training and
+# Before we dive in, let's focus on the reasons for always having training and
 # testing sets. Let's first look at the limitation of using a dataset without
 # keeping any samples out.
 #

diff --git a/_sources/python_scripts/ensemble_sol_02.py b/_sources/python_scripts/ensemble_sol_02.py
@@ -103,3 +103,10 @@
 
 plt.plot(data_range[feature_name], forest_predictions, label="Random forest")
 _ = plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
+
+# %% [markdown] tags=["solution"]
+# The random forest reduces the overfitting of the individual trees but still
+# overfits itself. In the section on "hyperparameter tuning with ensemble
+# methods" we will see how to further mitigate this effect. Still, interested
+# users may increase the number of estimators in the forest and try different
+# values of, e.g., `min_samples_split`.
diff --git a/_sources/python_scripts/linear_models_ex_04.py b/_sources/python_scripts/linear_models_ex_04.py
@@ -17,7 +17,7 @@
 # In the previous Module we tuned the hyperparameter `C` of the logistic
 # regression without mentioning that it controls the regularization strength.
 # Later, on the slides on 🎥 **Intuitions on regularized linear models** we
-# metioned that a small `C` provides a more regularized model, whereas a
+# mentioned that a small `C` provides a more regularized model, whereas a
 # non-regularized model is obtained with an infinitely large value of `C`.
 # Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
 # model.

diff --git a/_sources/python_scripts/linear_models_sol_04.py b/_sources/python_scripts/linear_models_sol_04.py
@@ -11,7 +11,7 @@
 # In the previous Module we tuned the hyperparameter `C` of the logistic
 # regression without mentioning that it controls the regularization strength.
 # Later, on the slides on 🎥 **Intuitions on regularized linear models** we
-# metioned that a small `C` provides a more regularized model, whereas a
+# mentioned that a small `C` provides a more regularized model, whereas a
 # non-regularized model is obtained with an infinitely large value of `C`.
 # Indeed, `C` behaves as the inverse of the `alpha` coefficient in the `Ridge`
 # model.

diff --git a/_sources/python_scripts/metrics_regression.py b/_sources/python_scripts/metrics_regression.py
@@ -97,8 +97,9 @@
 # %% [markdown]
 # The $R^2$ score represents the proportion of variance of the target that is
 # explained by the independent variables in the model. The best score possible
-# is 1 but there is no lower bound. However, a model that predicts the expected
-# value of the target would get a score of 0.
+# is 1 but there is no lower bound. However, a model that predicts the [expected
+# value](https://en.wikipedia.org/wiki/Expected_value) of the target would get a
+# score of 0.
 
 # %%
 from sklearn.dummy import DummyRegressor

diff --git a/appendix/notebook_timings.html b/appendix/notebook_timings.html
@@ -668,9 +668,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 </thead>
 <tbody>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/01_tabular_data_exploration.html"><span class="doc">python_scripts/01_tabular_data_exploration</span></a></p></td>
-<td><p>2024-04-26 13:50</p></td>
+<td><p>2024-04-26 13:51</p></td>
 <td><p>cache</p></td>
-<td><p>8.22</p></td>
+<td><p>7.83</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/01_tabular_data_exploration_ex_01.html"><span class="doc">python_scripts/01_tabular_data_exploration_ex_01</span></a></p></td>
@@ -704,15 +704,15 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/02_numerical_pipeline_hands_on.html"><span class="doc">python_scripts/02_numerical_pipeline_hands_on</span></a></p></td>
-<td><p>2024-04-26 13:50</p></td>
+<td><p>2024-04-26 13:51</p></td>
 <td><p>cache</p></td>
-<td><p>2.01</p></td>
+<td><p>1.98</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/02_numerical_pipeline_introduction.html"><span class="doc">python_scripts/02_numerical_pipeline_introduction</span></a></p></td>
-<td><p>2024-04-26 13:50</p></td>
+<td><p>2024-04-26 13:51</p></td>
 <td><p>cache</p></td>
-<td><p>4.8</p></td>
+<td><p>5.06</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/02_numerical_pipeline_scaling.html"><span class="doc">python_scripts/02_numerical_pipeline_scaling</span></a></p></td>
@@ -734,15 +734,15 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/03_categorical_pipeline.html"><span class="doc">python_scripts/03_categorical_pipeline</span></a></p></td>
-<td><p>2024-04-26 13:50</p></td>
+<td><p>2024-04-26 13:51</p></td>
 <td><p>cache</p></td>
-<td><p>2.8</p></td>
+<td><p>3.06</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/03_categorical_pipeline_column_transformer.html"><span class="doc">python_scripts/03_categorical_pipeline_column_transformer</span></a></p></td>
-<td><p>2024-04-26 13:50</p></td>
+<td><p>2024-04-26 13:52</p></td>
 <td><p>cache</p></td>
-<td><p>4.23</p></td>
+<td><p>4.42</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/03_categorical_pipeline_ex_01.html"><span class="doc">python_scripts/03_categorical_pipeline_ex_01</span></a></p></td>
@@ -836,9 +836,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/cross_validation_train_test.html"><span class="doc">python_scripts/cross_validation_train_test</span></a></p></td>
-<td><p>2024-04-26 13:51</p></td>
+<td><p>2024-04-26 13:52</p></td>
 <td><p>cache</p></td>
-<td><p>10.87</p></td>
+<td><p>11.39</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/cross_validation_validation_curve.html"><span class="doc">python_scripts/cross_validation_validation_curve</span></a></p></td>
@@ -1004,9 +1004,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/linear_models_ex_02.html"><span class="doc">python_scripts/linear_models_ex_02</span></a></p></td>
-<td><p>2024-04-26 13:51</p></td>
+<td><p>2024-04-26 13:52</p></td>
 <td><p>cache</p></td>
-<td><p>1.09</p></td>
+<td><p>1.17</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/linear_models_ex_03.html"><span class="doc">python_scripts/linear_models_ex_03</span></a></p></td>
@@ -1040,9 +1040,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/linear_models_sol_02.html"><span class="doc">python_scripts/linear_models_sol_02</span></a></p></td>
-<td><p>2024-04-26 13:51</p></td>
+<td><p>2024-04-26 13:52</p></td>
 <td><p>cache</p></td>
-<td><p>6.1</p></td>
+<td><p>6.45</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/linear_models_sol_03.html"><span class="doc">python_scripts/linear_models_sol_03</span></a></p></td>
@@ -1070,9 +1070,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/linear_regression_without_sklearn.html"><span class="doc">python_scripts/linear_regression_without_sklearn</span></a></p></td>
-<td><p>2024-04-26 13:51</p></td>
+<td><p>2024-04-26 13:52</p></td>
 <td><p>cache</p></td>
-<td><p>2.65</p></td>
+<td><p>2.99</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/logistic_regression.html"><span class="doc">python_scripts/logistic_regression</span></a></p></td>
@@ -1130,15 +1130,15 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/parameter_tuning_grid_search.html"><span class="doc">python_scripts/parameter_tuning_grid_search</span></a></p></td>
-<td><p>2024-04-26 13:51</p></td>
+<td><p>2024-04-26 13:52</p></td>
 <td><p>cache</p></td>
-<td><p>10.21</p></td>
+<td><p>10.5</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/parameter_tuning_manual.html"><span class="doc">python_scripts/parameter_tuning_manual</span></a></p></td>
-<td><p>2024-04-26 13:51</p></td>
+<td><p>2024-04-26 13:52</p></td>
 <td><p>cache</p></td>
-<td><p>4.17</p></td>
+<td><p>4.45</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/parameter_tuning_nested.html"><span class="doc">python_scripts/parameter_tuning_nested</span></a></p></td>
@@ -1154,9 +1154,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/parameter_tuning_randomized_search.html"><span class="doc">python_scripts/parameter_tuning_randomized_search</span></a></p></td>
-<td><p>2024-04-26 13:51</p></td>
+<td><p>2024-04-26 13:53</p></td>
 <td><p>cache</p></td>
-<td><p>24.21</p></td>
+<td><p>22.88</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/parameter_tuning_sol_02.html"><span class="doc">python_scripts/parameter_tuning_sol_02</span></a></p></td>
@@ -1178,9 +1178,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
 <td><p>✅</p></td>
 </tr>
 <tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/trees_dataset.html"><span class="doc">python_scripts/trees_dataset</span></a></p></td>
-<td><p>2024-04-26 13:51</p></td>
+<td><p>2024-04-26 13:53</p></td>
 <td><p>cache</p></td>
-<td><p>2.75</p></td>
+<td><p>3.06</p></td>
 <td><p>✅</p></td>
 </tr>
 <tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/trees_ex_01.html"><span class="doc">python_scripts/trees_ex_01</span></a></p></td>

diff --git a/python_scripts/02_numerical_pipeline_introduction.html b/python_scripts/02_numerical_pipeline_introduction.html
@@ -1003,7 +1003,7 @@ <h2>Separate the data and the target<a class="headerlink" href="#separate-the-da
 <p>39073 rows × 4 columns</p>
 </div></div></div>
 </div>
-<p>We can now linger on the variables, also denominated features, that we later
+<p>We can now focus on the variables, also denominated features, that we later
 use to build our predictive model. In addition, we can also check how many
 samples are available in our dataset.</p>
 <div class="cell docutils container">

diff --git a/python_scripts/03_categorical_pipeline.html b/python_scripts/03_categorical_pipeline.html
@@ -1958,7 +1958,7 @@ <h2>Evaluate our predictive pipeline<a class="headerlink" href="#evaluate-our-pr
 did with numerical data: let’s train a linear classifier on the encoded data
 and check the generalization performance of this machine learning pipeline using
 cross-validation.</p>
-<p>Before we create the pipeline, we have to linger on the <code class="docutils literal notranslate"><span class="pre">native-country</span></code>.
+<p>Before we create the pipeline, we have to focus on the <code class="docutils literal notranslate"><span class="pre">native-country</span></code>.
 Let’s recall some statistics regarding this column.</p>
 <div class="cell docutils container">
 <div class="cell_input docutils container">
@@ -2078,8 +2078,8 @@ <h2>Evaluate our predictive pipeline<a class="headerlink" href="#evaluate-our-pr
 </div>
 </div>
 <div class="cell_output docutils container">
-<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>{&#39;fit_time&#39;: array([0.18181372, 0.16217852, 0.17221045, 0.17812014, 0.16503692]),
- &#39;score_time&#39;: array([0.02207351, 0.02198744, 0.02215958, 0.02402902, 0.02280927]),
+<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>{&#39;fit_time&#39;: array([0.18064904, 0.16906261, 0.17876267, 0.20569158, 0.17099452]),
+ &#39;score_time&#39;: array([0.02239752, 0.0232501 , 0.02577472, 0.02373815, 0.02260184]),
  &#39;test_score&#39;: array([0.83232675, 0.83570478, 0.82831695, 0.83292383, 0.83497133])}
 </pre></div>
 </div>
@@ -2098,9 +2098,10 @@ <h2>Evaluate our predictive pipeline<a class="headerlink" href="#evaluate-our-pr
 </div>
 </div>
 </div>
-<p>As you can see, this representation of the categorical variables is
-slightly more predictive of the revenue than the numerical variables
-that we used previously.</p>
+<p>As you can see, this representation of the categorical variables is slightly
+more predictive of the revenue than the numerical variables that we used
+previously. The reason being that we have more (predictive) categorical
+features than numerical ones.</p>
 <p>In this notebook we have:</p>
 <ul class="simple">
 <li><p>seen two common strategies for encoding categorical features: <strong>ordinal

diff --git a/python_scripts/03_categorical_pipeline_column_transformer.html b/python_scripts/03_categorical_pipeline_column_transformer.html
@@ -1571,8 +1571,8 @@ <h2>Evaluation of the model with cross-validation<a class="headerlink" href="#ev
 </div>
 </div>
 <div class="cell_output docutils container">
-<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>{&#39;fit_time&#39;: array([0.24788404, 0.24504352, 0.22222066, 0.23252439, 0.26179743]),
- &#39;score_time&#39;: array([0.02705979, 0.0278194 , 0.02626395, 0.02863431, 0.02582169]),
+<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>{&#39;fit_time&#39;: array([0.25630689, 0.26000094, 0.22319031, 0.24449325, 0.26766682]),
+ &#39;score_time&#39;: array([0.02926874, 0.02974772, 0.02790833, 0.02923989, 0.02736449]),
  &#39;test_score&#39;: array([0.85116184, 0.84993346, 0.8482801 , 0.85257985, 0.85544636])}
 </pre></div>
 </div>
@@ -1644,8 +1644,8 @@ <h2>Fitting a more powerful model<a class="headerlink" href="#fitting-a-more-pow
 </div>
 </div>
 <div class="cell_output docutils container">
-<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>CPU times: user 657 ms, sys: 15.8 ms, total: 672 ms
-Wall time: 672 ms
+<div class="output stream highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>CPU times: user 680 ms, sys: 12 ms, total: 692 ms
+Wall time: 692 ms
 </pre></div>
 </div>
 </div>
@@ -1657,7 +1657,7 @@ <h2>Fitting a more powerful model<a class="headerlink" href="#fitting-a-more-pow
 </div>
 </div>
 <div class="cell_output docutils container">
-<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>0.881008926377856
+<div class="output text_plain highlight-myst-ansi notranslate"><div class="highlight"><pre><span></span>0.8805994595037262
 </pre></div>
 </div>
 </div>