diff --git a/notebooks/datasets_adult_census.ipynb b/notebooks/datasets_adult_census.ipynb index 371525cc2..139287829 100644 --- a/notebooks/datasets_adult_census.ipynb +++ b/notebooks/datasets_adult_census.ipynb @@ -4,7 +4,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# The Adult census dataset\n", + "# The adult census dataset\n", "\n", "[This dataset](http://www.openml.org/d/1590) is a collection of demographic\n", "information for the adult population as of 1994 in the USA. The prediction\n", diff --git a/notebooks/linear_models_ex_03.ipynb b/notebooks/linear_models_ex_03.ipynb index 7ada01f07..29db0d1d4 100644 --- a/notebooks/linear_models_ex_03.ipynb +++ b/notebooks/linear_models_ex_03.ipynb @@ -131,7 +131,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For the following questions, you can copy adn paste the following snippet to\n", + "For the following questions, you can copy and paste the following snippet to\n", "get the feature names from the column transformer here named `preprocessor`.\n", "\n", "```python\n", diff --git a/notebooks/linear_models_sol_02.ipynb b/notebooks/linear_models_sol_02.ipynb index 38ac00ef6..e124537d1 100644 --- a/notebooks/linear_models_sol_02.ipynb +++ b/notebooks/linear_models_sol_02.ipynb @@ -223,9 +223,9 @@ "outputs": [], "source": [ "# solution\n", - "culmen_length_first_sample = 181.0\n", + "flipper_length_first_sample = 181.0\n", "culmen_depth_first_sample = 18.7\n", - "culmen_length_first_sample * culmen_depth_first_sample" + "flipper_length_first_sample * culmen_depth_first_sample" ] }, { diff --git a/notebooks/linear_models_sol_03.ipynb b/notebooks/linear_models_sol_03.ipynb index 20256e76b..ce7e5ace7 100644 --- a/notebooks/linear_models_sol_03.ipynb +++ b/notebooks/linear_models_sol_03.ipynb @@ -211,7 +211,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For the following questions, you can copy adn paste the following snippet to\n", + "For the following questions, you can copy and paste the following snippet to\n", "get the feature names from the column transformer here named `preprocessor`.\n", "\n", "```python\n", diff --git a/notebooks/linear_regression_non_linear_link.ipynb b/notebooks/linear_regression_non_linear_link.ipynb index 33f6936cc..9a060a2c0 100644 --- a/notebooks/linear_regression_non_linear_link.ipynb +++ b/notebooks/linear_regression_non_linear_link.ipynb @@ -2,7 +2,6 @@ "cells": [ { "cell_type": "markdown", - "id": "14eec485", "metadata": {}, "source": [ "# Non-linear feature engineering for Linear Regression\n", @@ -25,7 +24,6 @@ { "cell_type": "code", "execution_count": null, - "id": "8f516165", "metadata": {}, "outputs": [], "source": [ @@ -44,13 +42,13 @@ }, { "cell_type": "markdown", - "id": "00fd3b4f", "metadata": {}, "source": [ - "```{tip}\n", - "`np.random.RandomState` allows to create a random number generator which can\n", - "be later used to get deterministic results.\n", - "```\n", + "
\n", + "

Tip

\n", + "

np.random.RandomState allows to create a random number generator which can\n", + "be later used to get deterministic results.

\n", + "
\n", "\n", "To ease the plotting, we create a pandas dataframe containing the data and\n", "target:" @@ -59,7 +57,6 @@ { "cell_type": "code", "execution_count": null, - "id": "5459a97b", "metadata": {}, "outputs": [], "source": [ @@ -71,7 +68,6 @@ { "cell_type": "code", "execution_count": null, - "id": "8b1b2257", "metadata": {}, "outputs": [], "source": [ @@ -84,22 +80,21 @@ }, { "cell_type": "markdown", - "id": "be69fae1", "metadata": {}, "source": [ - "```{warning}\n", - "In scikit-learn, by convention `data` (also called `X` in the scikit-learn\n", - "documentation) should be a 2D matrix of shape `(n_samples, n_features)`.\n", - "If `data` is a 1D vector, you need to reshape it into a matrix with a\n", + "
\n", + "

Warning

\n", + "

In scikit-learn, by convention data (also called X in the scikit-learn\n", + "documentation) should be a 2D matrix of shape (n_samples, n_features).\n", + "If data is a 1D vector, you need to reshape it into a matrix with a\n", "single column if the vector represents a feature or a single row if the\n", - "vector represents a sample.\n", - "```" + "vector represents a sample.

\n", + "
" ] }, { "cell_type": "code", "execution_count": null, - "id": "46804be9", "metadata": {}, "outputs": [], "source": [ @@ -110,7 +105,6 @@ }, { "cell_type": "markdown", - "id": "a4209f00", "metadata": { "lines_to_next_cell": 2 }, @@ -122,7 +116,6 @@ { "cell_type": "code", "execution_count": null, - "id": "a1bd392b", "metadata": {}, "outputs": [], "source": [ @@ -142,7 +135,6 @@ }, { "cell_type": "markdown", - "id": "7bfcbeb8", "metadata": {}, "source": [ "We now observe the limitations of fitting a linear regression model." @@ -151,7 +143,6 @@ { "cell_type": "code", "execution_count": null, - "id": "1545fec5", "metadata": {}, "outputs": [], "source": [ @@ -165,7 +156,6 @@ { "cell_type": "code", "execution_count": null, - "id": "e8c79631", "metadata": {}, "outputs": [], "source": [ @@ -174,7 +164,6 @@ }, { "cell_type": "markdown", - "id": "545fc1f3", "metadata": {}, "source": [ "Here the coefficient and intercept learnt by `LinearRegression` define the\n", @@ -185,7 +174,6 @@ { "cell_type": "code", "execution_count": null, - "id": "0f95ceef", "metadata": {}, "outputs": [], "source": [ @@ -197,7 +185,6 @@ }, { "cell_type": "markdown", - "id": "1a34a48c", "metadata": {}, "source": [ "Notice that the learnt model cannot handle the non-linear relationship between\n", @@ -217,7 +204,6 @@ { "cell_type": "code", "execution_count": null, - "id": "e01b02d2", "metadata": {}, "outputs": [], "source": [ @@ -230,7 +216,6 @@ { "cell_type": "code", "execution_count": null, - "id": "9a27773e", "metadata": {}, "outputs": [], "source": [ @@ -239,7 +224,6 @@ }, { "cell_type": "markdown", - "id": "4d5070e3", "metadata": {}, "source": [ "Instead of having a model which can natively deal with non-linearity, we could\n", @@ -256,7 +240,6 @@ { "cell_type": "code", "execution_count": null, - "id": "28c13246", "metadata": {}, "outputs": [], "source": [ @@ -266,7 +249,6 @@ { "cell_type": "code", "execution_count": null, - "id": "69d0ba50", "metadata": {}, "outputs": [], "source": [ @@ -276,7 +258,6 @@ }, { "cell_type": "markdown", - "id": "7925141e", "metadata": {}, "source": [ "Instead of manually creating such polynomial features one could directly use\n", @@ -286,7 +267,6 @@ { "cell_type": "code", "execution_count": null, - "id": "d31ed0f4", "metadata": {}, "outputs": [], "source": [ @@ -297,7 +277,6 @@ }, { "cell_type": "markdown", - "id": "6a7fe453", "metadata": {}, "source": [ "In the previous cell we had to set `include_bias=False` as otherwise we would\n", @@ -312,7 +291,6 @@ }, { "cell_type": "markdown", - "id": "269fbe2b", "metadata": {}, "source": [ "To demonstrate the use of the `PolynomialFeatures` class, we use a\n", @@ -323,7 +301,6 @@ { "cell_type": "code", "execution_count": null, - "id": "38ba0c5c", "metadata": {}, "outputs": [], "source": [ @@ -340,7 +317,6 @@ { "cell_type": "code", "execution_count": null, - "id": "5df7d4a4", "metadata": {}, "outputs": [], "source": [ @@ -349,7 +325,6 @@ }, { "cell_type": "markdown", - "id": "fe259d20", "metadata": {}, "source": [ "We can see that even with a linear model, we can overcome the linearity\n", @@ -379,7 +354,6 @@ { "cell_type": "code", "execution_count": null, - "id": "7d46da9b", "metadata": {}, "outputs": [], "source": [ @@ -392,7 +366,6 @@ { "cell_type": "code", "execution_count": null, - "id": "9406b676", "metadata": {}, "outputs": [], "source": [ @@ -401,7 +374,6 @@ }, { "cell_type": "markdown", - "id": "fd29730e", "metadata": {}, "source": [ "The predictions of our SVR with a linear kernel are all aligned on a straight\n", @@ -419,7 +391,6 @@ { "cell_type": "code", "execution_count": null, - "id": "ae1550fa", "metadata": {}, "outputs": [], "source": [ @@ -430,7 +401,6 @@ { "cell_type": "code", "execution_count": null, - "id": "c4670a4e", "metadata": {}, "outputs": [], "source": [ @@ -439,7 +409,6 @@ }, { "cell_type": "markdown", - "id": "732b2b0f", "metadata": {}, "source": [ "Kernel methods such as SVR are very efficient for small to medium datasets.\n", @@ -460,7 +429,6 @@ { "cell_type": "code", "execution_count": null, - "id": "e30e6b37", "metadata": {}, "outputs": [], "source": [ @@ -476,7 +444,6 @@ { "cell_type": "code", "execution_count": null, - "id": "b46eb0ef", "metadata": {}, "outputs": [], "source": [ @@ -486,7 +453,6 @@ { "cell_type": "code", "execution_count": null, - "id": "5403e6b1", "metadata": {}, "outputs": [], "source": [ @@ -502,7 +468,6 @@ { "cell_type": "code", "execution_count": null, - "id": "0dcdfe92", "metadata": {}, "outputs": [], "source": [ @@ -511,7 +476,6 @@ }, { "cell_type": "markdown", - "id": "4b4f0560", "metadata": {}, "source": [ "`Nystroem` is a nice alternative to `PolynomialFeatures` that makes it\n", @@ -523,7 +487,6 @@ { "cell_type": "code", "execution_count": null, - "id": "41d6abd8", "metadata": {}, "outputs": [], "source": [ @@ -539,7 +502,6 @@ { "cell_type": "code", "execution_count": null, - "id": "be6a232c", "metadata": {}, "outputs": [], "source": [ @@ -550,7 +512,6 @@ }, { "cell_type": "markdown", - "id": "7860e12d", "metadata": {}, "source": [ "## Notebook Recap\n", @@ -579,4 +540,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/notebooks/linear_regression_without_sklearn.ipynb b/notebooks/linear_regression_without_sklearn.ipynb index 22707379c..039e1014a 100644 --- a/notebooks/linear_regression_without_sklearn.ipynb +++ b/notebooks/linear_regression_without_sklearn.ipynb @@ -7,8 +7,8 @@ "# Linear regression without scikit-learn\n", "\n", "In this notebook, we introduce linear regression. Before presenting the\n", - "available scikit-learn classes, we will provide some insights with a simple\n", - "example. We will use a dataset that contains measurements taken on penguins." + "available scikit-learn classes, here we provide some insights with a simple\n", + "example. We use a dataset that contains measurements taken on penguins." ] }, { @@ -38,8 +38,8 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We will formulate the following problem: using the flipper length of a\n", - "penguin, we would like to infer its mass." + "We aim to solve the following problem: using the flipper length of a penguin,\n", + "we would like to infer its mass." ] }, { @@ -110,8 +110,8 @@ "metadata": {}, "source": [ "Using the model we defined above, we can check the body mass values predicted\n", - "for a range of flipper lengths. We will set `weight_flipper_length` to be 45\n", - "and `intercept_body_mass` to be -5000." + "for a range of flipper lengths. We set `weight_flipper_length` and\n", + "`intercept_body_mass` to arbitrary values of 45 and -5000, respectively." ] }, { @@ -159,7 +159,7 @@ "source": [ "The variable `weight_flipper_length` is a weight applied to the feature\n", "`flipper_length` in order to make the inference. When this coefficient is\n", - "positive, it means that penguins with longer flipper lengths will have larger\n", + "positive, it means that penguins with longer flipper lengths have larger\n", "body masses. If the coefficient is negative, it means that penguins with\n", "shorter flipper lengths have larger body masses. Graphically, this coefficient\n", "is represented by the slope of the curve in the plot. Below we show what the\n", @@ -207,7 +207,7 @@ "source": [ "In our case, this coefficient has a meaningful unit: g/mm. For instance, a\n", "coefficient of 40 g/mm, means that for each additional millimeter in flipper\n", - "length, the body weight predicted will increase by 40 g." + "length, the body weight predicted increases by 40 g." ] }, { @@ -238,8 +238,8 @@ "This parameter corresponds to the value on the y-axis if `flipper_length=0`\n", "(which in our case is only a mathematical consideration, as in our data, the\n", " value of `flipper_length` only goes from 170mm to 230mm). This y-value when\n", - "x=0 is called the y-intercept. If `intercept_body_mass` is 0, the curve will\n", - "pass through the origin:" + "x=0 is called the y-intercept. If `intercept_body_mass` is 0, the curve passes\n", + "through the origin:" ] }, { @@ -275,7 +275,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Otherwise, it will pass through the `intercept_body_mass` value:" + "Otherwise, it passes through the `intercept_body_mass` value:" ] }, { diff --git a/notebooks/logistic_regression.ipynb b/notebooks/logistic_regression.ipynb index 691283b02..0efd4e0dc 100644 --- a/notebooks/logistic_regression.ipynb +++ b/notebooks/logistic_regression.ipynb @@ -2,7 +2,6 @@ "cells": [ { "cell_type": "markdown", - "id": "b0e67575", "metadata": {}, "source": [ "# Linear models for classification\n", @@ -18,19 +17,18 @@ }, { "cell_type": "markdown", - "id": "ac574018", "metadata": {}, "source": [ - "```{note}\n", - "If you want a deeper overview regarding this dataset, you can refer to the\n", - "Appendix - Datasets description section at the end of this MOOC.\n", - "```" + "
\n", + "

Note

\n", + "

If you want a deeper overview regarding this dataset, you can refer to the\n", + "Appendix - Datasets description section at the end of this MOOC.

\n", + "
" ] }, { "cell_type": "code", "execution_count": null, - "id": "a47d670a", "metadata": {}, "outputs": [], "source": [ @@ -48,7 +46,6 @@ }, { "cell_type": "markdown", - "id": "2165fcfc", "metadata": {}, "source": [ "We can quickly start by visualizing the feature distribution by class:" @@ -57,7 +54,6 @@ { "cell_type": "code", "execution_count": null, - "id": "9ac5a70c", "metadata": {}, "outputs": [], "source": [ @@ -72,7 +68,6 @@ }, { "cell_type": "markdown", - "id": "cab96de7", "metadata": {}, "source": [ "We can observe that we have quite a simple problem. When the culmen length\n", @@ -86,7 +81,6 @@ { "cell_type": "code", "execution_count": null, - "id": "b6a3b04c", "metadata": {}, "outputs": [], "source": [ @@ -103,7 +97,6 @@ }, { "cell_type": "markdown", - "id": "4964b148", "metadata": {}, "source": [ "The linear regression that we previously saw predicts a continuous output.\n", @@ -117,7 +110,6 @@ { "cell_type": "code", "execution_count": null, - "id": "47347104", "metadata": {}, "outputs": [], "source": [ @@ -133,7 +125,6 @@ }, { "cell_type": "markdown", - "id": "bafd8265", "metadata": {}, "source": [ "Since we are dealing with a classification problem containing only 2 features,\n", @@ -141,22 +132,21 @@ "the rule used by our predictive model to affect a class label given the\n", "feature values of the sample.\n", "\n", - "```{note}\n", - "Here, we use the class `DecisionBoundaryDisplay`. This educational tool allows\n", + "
\n", + "

Note

\n", + "

Here, we use the class DecisionBoundaryDisplay. This educational tool allows\n", "us to gain some insights by plotting the decision function boundary learned by\n", - "the classifier in a 2 dimensional feature space.\n", - "\n", - "Notice however that in more realistic machine learning contexts, one would\n", + "the classifier in a 2 dimensional feature space.

\n", + "

Notice however that in more realistic machine learning contexts, one would\n", "typically fit on more than two features at once and therefore it would not be\n", "possible to display such a visualization of the decision boundary in\n", - "general.\n", - "```" + "general.

\n", + "
" ] }, { "cell_type": "code", "execution_count": null, - "id": "dd628d44", "metadata": {}, "outputs": [], "source": [ @@ -182,7 +172,6 @@ }, { "cell_type": "markdown", - "id": "dbd93bf3", "metadata": {}, "source": [ "Thus, we see that our decision function is represented by a straight line\n", @@ -208,7 +197,6 @@ { "cell_type": "code", "execution_count": null, - "id": "8c76e56c", "metadata": {}, "outputs": [], "source": [ @@ -219,7 +207,6 @@ }, { "cell_type": "markdown", - "id": "416a9aff", "metadata": {}, "source": [ "You can [access pipeline\n", @@ -236,7 +223,6 @@ { "cell_type": "code", "execution_count": null, - "id": "8c9b19ae", "metadata": {}, "outputs": [], "source": [ @@ -246,7 +232,6 @@ }, { "cell_type": "markdown", - "id": "083d61ff", "metadata": {}, "source": [ "If one of the weights had been zero, the decision boundary would have been\n", @@ -266,7 +251,6 @@ { "cell_type": "code", "execution_count": null, - "id": "d30ac7e5", "metadata": {}, "outputs": [], "source": [ @@ -278,7 +262,6 @@ }, { "cell_type": "markdown", - "id": "6e7141da", "metadata": {}, "source": [ "In this case, our logistic regression classifier predicts the Chinstrap\n", @@ -286,7 +269,7 @@ "coordinates of this test data point match a location close to the decision\n", "boundary, in the red region.\n", "\n", - "As mentioned in the introductory slides 🎥 **Intuitions on linear models**,\n", + "As mentioned in the introductory slides \ud83c\udfa5 **Intuitions on linear models**,\n", "one can alternatively use the `predict_proba` method to compute continuous\n", "values (\"soft predictions\") that correspond to an estimation of the confidence\n", "of the target belonging to each class.\n", @@ -301,7 +284,6 @@ { "cell_type": "code", "execution_count": null, - "id": "f03d6062", "metadata": {}, "outputs": [], "source": [ @@ -311,7 +293,6 @@ }, { "cell_type": "markdown", - "id": "bd3a7c7f", "metadata": {}, "source": [ "More in general, the output of `predict_proba` is an array of shape\n", @@ -321,7 +302,6 @@ { "cell_type": "code", "execution_count": null, - "id": "e12bb08c", "metadata": {}, "outputs": [], "source": [ @@ -330,7 +310,6 @@ }, { "cell_type": "markdown", - "id": "67f73ae8", "metadata": {}, "source": [ "Also notice that the sum of (estimated) predicted probabilities across classes\n", @@ -341,7 +320,6 @@ { "cell_type": "code", "execution_count": null, - "id": "427587b6", "metadata": {}, "outputs": [], "source": [ @@ -355,16 +333,16 @@ }, { "cell_type": "markdown", - "id": "053ad22c", "metadata": {}, "source": [ - "```{warning}\n", - "We insist that the output of `predict_proba` are just estimations. Their\n", + "
\n", + "

Warning

\n", + "

We insist that the output of predict_proba are just estimations. Their\n", "reliability on being a good estimate of the true conditional class-assignment\n", "probabilities depends on the quality of the model. Even classifiers with a\n", "high accuracy on a test set may be overconfident for some individuals and\n", - "underconfident for others.\n", - "```\n", + "underconfident for others.

\n", + "
\n", "\n", "Similarly to the hard decision boundary shown above, one can set the\n", "`response_method` to `\"predict_proba\"` in the `DecisionBoundaryDisplay` to\n", @@ -384,7 +362,6 @@ { "cell_type": "code", "execution_count": null, - "id": "fbcece8a", "metadata": {}, "outputs": [], "source": [ @@ -407,7 +384,6 @@ }, { "cell_type": "markdown", - "id": "54133c3a", "metadata": {}, "source": [ "For multi-class classification the logistic regression uses the [softmax\n", @@ -432,4 +408,4 @@ }, "nbformat": 4, "nbformat_minor": 5 -} +} \ No newline at end of file diff --git a/notebooks/trees_hyperparameters.ipynb b/notebooks/trees_hyperparameters.ipynb index e60248e94..b9de0ac27 100644 --- a/notebooks/trees_hyperparameters.ipynb +++ b/notebooks/trees_hyperparameters.ipynb @@ -347,14 +347,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "As expected, we see that the blue blob on the right and the red blob on the\n", - "top are easily separated. However, more splits will be required to better\n", - "split the blob were both blue and red data points are mixed.\n", - "\n", - "Indeed, we see that red blob on the top and the blue blob on the right of the\n", - "plot are perfectly separated. However, the tree is still making mistakes in\n", - "the area where the blobs are mixed together. Let's check the tree\n", - "representation." + "As expected, we see that the blue blob in the lower right and the red blob on\n", + "the top are easily separated. However, more splits will be required to better\n", + "split the blob were both blue and red data points are mixed." ] }, {