Skip to content

Commit

Permalink
[ci skip] CI Use actions upload-artifact and download-artifact v4 (IN…
Browse files Browse the repository at this point in the history
  • Loading branch information
lesteve committed Jan 3, 2025
1 parent e704044 commit af63f17
Show file tree
Hide file tree
Showing 38 changed files with 131 additions and 124 deletions.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
15 changes: 9 additions & 6 deletions _sources/python_scripts/01_tabular_data_exploration.py
Original file line number Diff line number Diff line change
Expand Up @@ -360,9 +360,12 @@
# We made important observations (which will be discussed later in more detail):
#
# * if your target variable is imbalanced (e.g., you have more samples from one
# target category than another), you may need special techniques for training
# and evaluating your machine learning model;
# * having redundant (or highly correlated) columns can be a problem for some
# machine learning algorithms;
# * contrary to decision tree, linear models can only capture linear
# interactions, so be aware of non-linear relationships in your data.
# target category than another), you may need to be careful when interpreting
# the values of performance metrics;
# * columns can be redundant (or highly correlated), which is not necessarily a
# problem, but may require special treatment as we will cover in future
# notebooks;
# * decision trees create prediction rules by comparing each feature to a
# threshold value, resulting in decision boundaries that are always parallel
# to the axes. In 2D, this means the boundaries are vertical or horizontal
# line segments at the feature threshold values.
12 changes: 8 additions & 4 deletions _sources/python_scripts/cross_validation_learning_curve.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,10 +102,14 @@
# benefit to adding samples anymore or assessing the potential gain of adding
# more samples into the training set.
#
# If we achieve a plateau and adding new samples in the training set does not
# reduce the testing error, we might have reached the Bayes error rate using the
# available model. Using a more complex model might be the only possibility to
# reduce the testing error further.
# If the testing error plateaus despite adding more training samples, it's
# possible that the model has achieved its optimal performance. In this case,
# using a more expressive model might help reduce the error further. Otherwise,
# the error may have reached the Bayes error rate, the theoretical minimum error
# due to inherent uncertainty not resolved by the available data. This minimum error is
# non-zero whenever some of the variation of the target variable `y` depends on
# external factors not fully observed in the features available in `X`, which is
# almost always the case in practice.
#
# ## Summary
#
Expand Down
72 changes: 34 additions & 38 deletions _sources/python_scripts/ensemble_bagging.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,31 +8,28 @@
# %% [markdown]
# # Bagging
#
# This notebook introduces a very natural strategy to build ensembles of machine
# learning models named "bagging".
# In this notebook we introduce a very natural strategy to build ensembles of
# machine learning models, named "bagging".
#
# "Bagging" stands for Bootstrap AGGregatING. It uses bootstrap resampling
# (random sampling with replacement) to learn several models on random
# variations of the training set. At predict time, the predictions of each
# learner are aggregated to give the final predictions.
#
# First, we will generate a simple synthetic dataset to get insights regarding
# bootstraping.
# We first create a simple synthetic dataset to better understand bootstrapping.

# %%
import pandas as pd
import numpy as np

# create a random number generator that will be used to set the randomness
rng = np.random.RandomState(1)


def generate_data(n_samples=30):
"""Generate synthetic dataset. Returns `data_train`, `data_test`,
`target_train`."""
x_min, x_max = -3, 3
rng = np.random.default_rng(1) # Create a random number generator
x = rng.uniform(x_min, x_max, size=n_samples)
noise = 4.0 * rng.randn(n_samples)
noise = 4.0 * rng.normal(size=(n_samples,))
y = x**3 - 0.5 * (x + 1) ** 2 + noise
y /= y.std()

Expand All @@ -57,9 +54,8 @@ def generate_data(n_samples=30):

# %% [markdown]
#
# The relationship between our feature and the target to predict is non-linear.
# However, a decision tree is capable of approximating such a non-linear
# dependency:
# The target to predict is a non-linear function of the only feature. However, a
# decision tree is capable of approximating such a non-linear dependency:

# %%
from sklearn.tree import DecisionTreeRegressor
Expand All @@ -86,23 +82,24 @@ def generate_data(n_samples=30):
#
# ## Bootstrap resampling
#
# Given a dataset with `n` data points, bootstrapping corresponds to resampling
# with replacement `n` out of such `n` data points uniformly at random.
# Bootstrapping involves uniformly resampling `n` data points from a dataset of
# `n` points, with replacement, ensuring each sample has an equal chance of
# selection.
#
# As a result, the output of the bootstrap sampling procedure is another dataset
# with also n data points, but likely with duplicates. As a consequence, there
# are also data points from the original dataset that are never selected to
# appear in a bootstrap sample (by chance). Those data points that are left away
# are often referred to as the out-of-bag sample.
# with `n` data points, likely containing duplicates. Consequently, some data
# points from the original dataset may not be selected for a bootstrap sample.
# These unselected data points are often referred to as the out-of-bag sample.
#
# We will create a function that given `data` and `target` will return a
# We now create a function that, given `data` and `target`, returns a
# resampled variation `data_bootstrap` and `target_bootstrap`.


# %%
def bootstrap_sample(data, target):
def bootstrap_sample(data, target, seed=0):
# Indices corresponding to a sampling with replacement of the same sample
# size than the original data
rng = np.random.default_rng(seed)
bootstrap_indices = rng.choice(
np.arange(target.shape[0]),
size=target.shape[0],
Expand All @@ -117,7 +114,7 @@ def bootstrap_sample(data, target):

# %% [markdown]
#
# We will generate 3 bootstrap samples and qualitatively check the difference
# We generate 3 bootstrap samples and qualitatively check the difference
# with the original dataset.

# %%
Expand All @@ -127,6 +124,7 @@ def bootstrap_sample(data, target):
data_bootstrap, target_bootstrap = bootstrap_sample(
data_train,
target_train,
seed=bootstrap_idx, # ensure bootstrap samples are different but reproducible
)
plt.figure()
plt.scatter(
Expand Down Expand Up @@ -179,9 +177,9 @@ def bootstrap_sample(data, target):
# %% [markdown]
#
# On average, roughly 63.2% of the original data points of the original dataset
# will be present in a given bootstrap sample. Since the bootstrap sample has
# the same size as the original dataset, there will be many samples that are in
# the bootstrap sample multiple times.
# are present in a given bootstrap sample. Since the bootstrap sample has the
# same size as the original dataset, there are many samples that are in the
# bootstrap sample multiple times.
#
# Using bootstrap we are able to generate many datasets, all slightly different.
# We can fit a decision tree for each of these datasets and they all shall be
Expand All @@ -193,7 +191,7 @@ def bootstrap_sample(data, target):
tree = DecisionTreeRegressor(max_depth=3, random_state=0)

data_bootstrap_sample, target_bootstrap_sample = bootstrap_sample(
data_train, target_train
data_train, target_train, seed=bootstrap_idx
)
tree.fit(data_bootstrap_sample, target_bootstrap_sample)
bag_of_trees.append(tree)
Expand Down Expand Up @@ -224,7 +222,7 @@ def bootstrap_sample(data, target):
# %% [markdown]
# ## Aggregating
#
# Once our trees are fitted, we are able to get predictions for each of them. In
# Once our trees are fitted, we are able to get predictions from each of them. In
# regression, the most straightforward way to combine those predictions is just
# to average them: for a given test data point, we feed the input feature values
# to each of the `n` trained models in the ensemble and as a result compute `n`
Expand Down Expand Up @@ -262,7 +260,7 @@ def bootstrap_sample(data, target):

# %% [markdown]
#
# The unbroken red line shows the averaged predictions, which would be the final
# The continuous red line shows the averaged predictions, which would be the final
# predictions given by our 'bag' of decision tree regressors. Note that the
# predictions of the ensemble is more stable because of the averaging operation.
# As a result, the bag of trees as a whole is less likely to overfit than the
Expand Down Expand Up @@ -298,7 +296,7 @@ def bootstrap_sample(data, target):
bagged_trees_predictions = bagged_trees.predict(data_test)
plt.plot(data_test["Feature"], bagged_trees_predictions)

_ = plt.title("Predictions from a bagging classifier")
_ = plt.title("Predictions from a bagging regressor")

# %% [markdown]
# Because we use 100 trees in the ensemble, the average prediction is indeed
Expand Down Expand Up @@ -338,15 +336,14 @@ def bootstrap_sample(data, target):

# %% [markdown]
# We used a low value of the opacity parameter `alpha` to better appreciate the
# overlap in the prediction functions of the individual trees.
#
# This visualization gives some insights on the uncertainty in the predictions
# in different areas of the feature space.
# overlap in the prediction functions of the individual trees. Such
# visualization also gives us an intuition on the variance in the predictions
# across different zones of the feature space.
#
# ## Bagging complex pipelines
#
# While we used a decision tree as a base model, nothing prevents us of using
# any other type of model.
# Even if here we used a decision tree as a base model, nothing prevents us from
# using any other type of model.
#
# As we know that the original data generating function is a noisy polynomial
# transformation of the input variable, let us try to fit a bagged polynomial
Expand All @@ -361,15 +358,14 @@ def bootstrap_sample(data, target):

polynomial_regressor = make_pipeline(
MinMaxScaler(),
PolynomialFeatures(degree=4),
PolynomialFeatures(degree=4, include_bias=False),
Ridge(alpha=1e-10),
)

# %% [markdown]
# This pipeline first scales the data to the 0-1 range with `MinMaxScaler`. Then
# it extracts degree-4 polynomial features. The resulting features will all stay
# in the 0-1 range by construction: if `x` lies in the 0-1 range then `x ** n`
# also lies in the 0-1 range for any value of `n`.
# This pipeline first scales the data to the 0-1 range using `MinMaxScaler`. It
# then generates degree-4 polynomial features. By design, these features remain
# in the 0-1 range, as any power of `x` within this range also stays within 0-1.
#
# Then the pipeline feeds the resulting non-linear features to a regularized
# linear regression model for the final prediction of the target variable.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -331,7 +331,7 @@ def plot_decision_boundary(model, title=None):
# from the previous models: its decision boundary can take a diagonal
# direction. Furthermore, we can observe that predictions are very confident in
# the low density regions of the feature space, even very close to the decision
# boundary
# boundary.
#
# We can obtain very similar results by using a kernel approximation technique
# such as the Nyström method with a polynomial kernel:
Expand Down
2 changes: 1 addition & 1 deletion _sources/python_scripts/logistic_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@
# by name or position. In the code above `logistic_regression[-1]` means the
# last step of the pipeline. Then you can access the attributes of that step such
# as `coef_`. Notice also that the `coef_` attribute is an array of shape (1,
# `n_features`) an then we access it via its first entry. Alternatively one
# `n_features`) and then we access it via its first entry. Alternatively one
# could use `coef_.ravel()`.
#
# We are now ready to visualize the weight values as a barplot:
Expand Down
4 changes: 3 additions & 1 deletion _sources/python_scripts/metrics_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -347,7 +347,9 @@
# of the positive class).

# %%
prevalence = target_test.value_counts()[1] / target_test.value_counts().sum()
prevalence = (
target_test.value_counts()["donated"] / target_test.value_counts().sum()
)
print(f"Prevalence of the class 'donated': {prevalence:.2f}")

# %% [markdown]
Expand Down
2 changes: 1 addition & 1 deletion _sources/python_scripts/parameter_tuning_grid_search.py
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@
# cross-validation by providing `model_grid_search` as a model to the
# `cross_validate` function.
#
# Here, we used a single train-test split to to evaluate `model_grid_search`. In
# Here, we used a single train-test split to evaluate `model_grid_search`. In
# a future notebook will go into more detail about nested cross-validation, when
# you use cross-validation both for hyperparameter tuning and model evaluation.
# ```
Expand Down
2 changes: 1 addition & 1 deletion _sources/python_scripts/parameter_tuning_sol_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -153,7 +153,7 @@
# holding on any axis of the parallel coordinate plot. You can then slide (move)
# the range selection and cross two selections to see the intersections.
#
# Selecting the best performing models (i.e. above an accuracy of ~0.68), we
# Selecting the best performing models (i.e. above R2 score of ~0.68), we
# observe that **in this case**:
#
# - scaling the data is important. All the best performing models use scaled
Expand Down
8 changes: 4 additions & 4 deletions appendix/notebook_timings.html
Original file line number Diff line number Diff line change
Expand Up @@ -893,9 +893,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
<td><p></p></td>
</tr>
<tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/ensemble_bagging.html"><span class="doc">python_scripts/ensemble_bagging</span></a></p></td>
<td><p>2025-01-02 15:43</p></td>
<td><p>2025-01-03 05:42</p></td>
<td><p>cache</p></td>
<td><p>4.64</p></td>
<td><p>7.41</p></td>
<td><p></p></td>
</tr>
<tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/ensemble_ex_01.html"><span class="doc">python_scripts/ensemble_ex_01</span></a></p></td>
Expand Down Expand Up @@ -1085,9 +1085,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
<td><p></p></td>
</tr>
<tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/metrics_classification.html"><span class="doc">python_scripts/metrics_classification</span></a></p></td>
<td><p>2025-01-02 15:51</p></td>
<td><p>2025-01-03 05:42</p></td>
<td><p>cache</p></td>
<td><p>3.21</p></td>
<td><p>3.11</p></td>
<td><p></p></td>
</tr>
<tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/metrics_ex_01.html"><span class="doc">python_scripts/metrics_ex_01</span></a></p></td>
Expand Down
15 changes: 9 additions & 6 deletions python_scripts/01_tabular_data_exploration.html
Original file line number Diff line number Diff line change
Expand Up @@ -1842,12 +1842,15 @@ <h2>Notebook Recap<a class="headerlink" href="#notebook-recap" title="Permalink
<p>We made important observations (which will be discussed later in more detail):</p>
<ul class="simple">
<li><p>if your target variable is imbalanced (e.g., you have more samples from one
target category than another), you may need special techniques for training
and evaluating your machine learning model;</p></li>
<li><p>having redundant (or highly correlated) columns can be a problem for some
machine learning algorithms;</p></li>
<li><p>contrary to decision tree, linear models can only capture linear
interactions, so be aware of non-linear relationships in your data.</p></li>
target category than another), you may need to be careful when interpreting
the values of performance metrics;</p></li>
<li><p>columns can be redundant (or highly correlated), which is not necessarily a
problem, but may require special treatment as we will cover in future
notebooks;</p></li>
<li><p>decision trees create prediction rules by comparing each feature to a
threshold value, resulting in decision boundaries that are always parallel
to the axes. In 2D, this means the boundaries are vertical or horizontal
line segments at the feature threshold values.</p></li>
</ul>
</section>
</section>
Expand Down
12 changes: 8 additions & 4 deletions python_scripts/cross_validation_learning_curve.html
Original file line number Diff line number Diff line change
Expand Up @@ -808,10 +808,14 @@ <h2>Learning curve<a class="headerlink" href="#learning-curve" title="Permalink
are searching for the plateau of the testing error for which there is no
benefit to adding samples anymore or assessing the potential gain of adding
more samples into the training set.</p>
<p>If we achieve a plateau and adding new samples in the training set does not
reduce the testing error, we might have reached the Bayes error rate using the
available model. Using a more complex model might be the only possibility to
reduce the testing error further.</p>
<p>If the testing error plateaus despite adding more training samples, it’s
possible that the model has achieved its optimal performance. In this case,
using a more expressive model might help reduce the error further. Otherwise,
the error may have reached the Bayes error rate, the theoretical minimum error
due to inherent uncertainty not resolved by the available data. This minimum error is
non-zero whenever some of the variation of the target variable <code class="docutils literal notranslate"><span class="pre">y</span></code> depends on
external factors not fully observed in the features available in <code class="docutils literal notranslate"><span class="pre">X</span></code>, which is
almost always the case in practice.</p>
</section>
<section id="summary">
<h2>Summary<a class="headerlink" href="#summary" title="Permalink to this heading">#</a></h2>
Expand Down
Loading

0 comments on commit af63f17

Please sign in to comment.