Skip to content

Commit 3083498

Browse files
author
ArturoAmorQ
committed
ENH Rework narrative of GBDT notebook
1 parent 8124c5b commit 3083498

File tree

1 file changed

+103
-106
lines changed

1 file changed

+103
-106
lines changed

python_scripts/ensemble_gradient_boosting.py

+103-106
Original file line numberDiff line numberDiff line change
@@ -8,26 +8,28 @@
88
# %% [markdown]
99
# # Gradient-boosting decision tree (GBDT)
1010
#
11-
# In this notebook, we will present the gradient boosting decision tree
12-
# algorithm and contrast it with AdaBoost.
11+
# In this notebook, we present the gradient boosting decision tree algorithm.
1312
#
14-
# Gradient-boosting differs from AdaBoost due to the following reason: instead
15-
# of assigning weights to specific samples, GBDT will fit a decision tree on the
16-
# residuals error (hence the name "gradient") of the previous tree. Therefore,
17-
# each new tree in the ensemble predicts the error made by the previous learner
18-
# instead of predicting the target directly.
13+
# Even if AdaBoost and GBDT are both boosting algorithms, they are different in
14+
# nature: the former assigns weights to specific samples, whereas GBDT fits
15+
# succesive decision trees on the residual errors (hence the name "gradient") of
16+
# their preceding tree. Therefore, each new tree in the ensemble tries to refine
17+
# its predictions by specifically addressing the errors made by the previous
18+
# learner, instead of predicting the target directly.
1919
#
20-
# In this section, we will provide some intuition about the way learners are
21-
# combined to give the final prediction. In this regard, let's go back to our
22-
# regression problem which is more intuitive for demonstrating the underlying
20+
# In this section, we provide some intuitions on the way learners are combined
21+
# to give the final prediction. For such purpose, we tackle a single-feature
22+
# regression problem, which is more intuitive for demonstrating the underlying
2323
# machinery.
24+
#
25+
# Later in this notebook we compare the performance of GBDT (boosting) with that
26+
# of a Random Forest (bagging) for a particular dataset.
2427

2528
# %%
2629
import pandas as pd
2730
import numpy as np
2831

29-
# Create a random number generator that will be used to set the randomness
30-
rng = np.random.RandomState(0)
32+
rng = np.random.RandomState(0) # Create a random number generator
3133

3234

3335
def generate_data(n_samples=50):
@@ -60,9 +62,9 @@ def generate_data(n_samples=50):
6062
_ = plt.title("Synthetic regression dataset")
6163

6264
# %% [markdown]
63-
# As we previously discussed, boosting will be based on assembling a sequence of
64-
# learners. We will start by creating a decision tree regressor. We will set the
65-
# depth of the tree so that the resulting learner will underfit the data.
65+
# As we previously discussed, boosting is based on assembling a sequence of
66+
# learners. We start by creating a decision tree regressor. We set the depth of
67+
# the tree to underfit the data on purpose.
6668

6769
# %%
6870
from sklearn.tree import DecisionTreeRegressor
@@ -74,45 +76,61 @@ def generate_data(n_samples=50):
7476
target_test_predicted = tree.predict(data_test)
7577

7678
# %% [markdown]
77-
# Using the term "test" here refers to data that was not used for training. It
78-
# should not be confused with data coming from a train-test split, as it was
79-
# generated in equally-spaced intervals for the visual evaluation of the
80-
# predictions.
79+
# Using the term "test" here refers to data not used for training. It should not
80+
# be confused with data coming from a train-test split, as it was generated in
81+
# equally-spaced intervals for the visual evaluation of the predictions.
82+
#
83+
# To avoid writing the same code in multiple places we define a helper function
84+
# to plot the data samples as well as the decision tree predictions and
85+
# residuals.
86+
8187

8288
# %%
83-
# plot the data
84-
sns.scatterplot(
85-
x=data_train["Feature"], y=target_train, color="black", alpha=0.5
86-
)
87-
# plot the predictions
88-
line_predictions = plt.plot(data_test["Feature"], target_test_predicted, "--")
89+
def plot_decision_tree_with_residuals(y_train, y_train_pred, y_test_pred):
90+
# Create a plot and get the Axes object
91+
fig, ax = plt.subplots()
92+
# plot the data
93+
sns.scatterplot(
94+
x=data_train["Feature"], y=y_train, color="black", alpha=0.5, ax=ax
95+
)
96+
# plot the predictions
97+
line_predictions = ax.plot(data_test["Feature"], y_test_pred, "--")
98+
99+
# plot the residuals
100+
for value, true, predicted in zip(
101+
data_train["Feature"], y_train, y_train_pred
102+
):
103+
lines_residuals = ax.plot(
104+
[value, value], [true, predicted], color="red"
105+
)
106+
107+
handles = [line_predictions[0], lines_residuals[0]]
108+
109+
return handles, ax
89110

90-
# plot the residuals
91-
for value, true, predicted in zip(
92-
data_train["Feature"], target_train, target_train_predicted
93-
):
94-
lines_residuals = plt.plot([value, value], [true, predicted], color="red")
95111

96-
plt.legend(
97-
[line_predictions[0], lines_residuals[0]], ["Fitted tree", "Residuals"]
112+
handles, ax = plot_decision_tree_with_residuals(
113+
target_train, target_train_predicted, target_test_predicted
98114
)
99-
_ = plt.title("Prediction function together \nwith errors on the training set")
115+
legend_labels = ["Initial decision tree", "Initial residuals"]
116+
ax.legend(handles, legend_labels, bbox_to_anchor=(1.05, 0.8), loc="upper left")
117+
_ = ax.set_title("Decision Tree together \nwith errors on the training set")
100118

101119
# %% [markdown]
102120
# ```{tip}
103121
# In the cell above, we manually edited the legend to get only a single label
104122
# for all the residual lines.
105123
# ```
106124
# Since the tree underfits the data, its accuracy is far from perfect on the
107-
# training data. We can observe this in the figure by looking at the difference
108-
# between the predictions and the ground-truth data. We represent these errors,
109-
# called "Residuals", by unbroken red lines.
125+
# training data. We can observe this in the figure above by looking at the
126+
# difference between the predictions and the ground-truth data. We represent
127+
# these errors, called "residuals", using solid red lines.
110128
#
111-
# Indeed, our initial tree was not expressive enough to handle the complexity of
129+
# Indeed, our initial tree is not expressive enough to handle the complexity of
112130
# the data, as shown by the residuals. In a gradient-boosting algorithm, the
113-
# idea is to create a second tree which, given the same data `data`, will try to
114-
# predict the residuals instead of the vector `target`. We would therefore have
115-
# a tree that is able to predict the errors made by the initial tree.
131+
# idea is to create a second tree which, given the same `data`, tries to predict
132+
# the residuals instead of the vector `target`, i.e. we have a second tree that
133+
# is able to predict the errors made by the initial tree.
116134
#
117135
# Let's train such a tree.
118136

@@ -126,29 +144,22 @@ def generate_data(n_samples=50):
126144
target_test_predicted_residuals = tree_residuals.predict(data_test)
127145

128146
# %%
129-
sns.scatterplot(x=data_train["Feature"], y=residuals, color="black", alpha=0.5)
130-
line_predictions = plt.plot(
131-
data_test["Feature"], target_test_predicted_residuals, "--"
147+
handles, ax = plot_decision_tree_with_residuals(
148+
residuals,
149+
target_train_predicted_residuals,
150+
target_test_predicted_residuals,
132151
)
133-
134-
# plot the residuals of the predicted residuals
135-
for value, true, predicted in zip(
136-
data_train["Feature"], residuals, target_train_predicted_residuals
137-
):
138-
lines_residuals = plt.plot([value, value], [true, predicted], color="red")
139-
140-
plt.legend(
141-
[line_predictions[0], lines_residuals[0]],
142-
["Fitted tree", "Residuals"],
143-
bbox_to_anchor=(1.05, 0.8),
144-
loc="upper left",
145-
)
146-
_ = plt.title("Prediction of the previous residuals")
152+
legend_labels = [
153+
"Predicted residuals",
154+
"Residuals of the\npredicted residuals",
155+
]
156+
ax.legend(handles, legend_labels, bbox_to_anchor=(1.05, 0.8), loc="upper left")
157+
_ = ax.set_title("Prediction of the initial residuals")
147158

148159
# %% [markdown]
149-
# We see that this new tree only manages to fit some of the residuals. We will
150-
# focus on a specific sample from the training set (i.e. we know that the sample
151-
# will be well predicted using two successive trees). We will use this sample to
160+
# We see that this new tree only manages to fit some of the residuals. We now
161+
# focus on a specific sample from the training set (as we know that the sample
162+
# can be well predicted using two successive trees). We will use this sample to
152163
# explain how the predictions of both trees are combined. Let's first select
153164
# this sample in `data_train`.
154165

@@ -159,66 +170,49 @@ def generate_data(n_samples=50):
159170
target_true_residual = residuals.iloc[-2]
160171

161172
# %% [markdown]
162-
# Let's plot the previous information and highlight our sample of interest.
163-
# Let's start by plotting the original data and the prediction of the first
164-
# decision tree.
173+
# Let's plot the original data, the predictions of the initial decision tree and
174+
# highlight our sample of interest, i.e. this is just a zoom of the plot
175+
# displaying the initial shallow tree.
165176

166177
# %%
167-
# Plot the previous information:
168-
# * the dataset
169-
# * the predictions
170-
# * the residuals
171-
172-
sns.scatterplot(
173-
x=data_train["Feature"], y=target_train, color="black", alpha=0.5
178+
handles, ax = plot_decision_tree_with_residuals(
179+
target_train, target_train_predicted, target_test_predicted
174180
)
175-
plt.plot(data_test["Feature"], target_test_predicted, "--")
176-
for value, true, predicted in zip(
177-
data_train["Feature"], target_train, target_train_predicted
178-
):
179-
lines_residuals = plt.plot([value, value], [true, predicted], color="red")
180-
181-
# Highlight the sample of interest
182-
plt.scatter(
181+
ax.scatter(
183182
sample, target_true, label="Sample of interest", color="tab:orange", s=200
184183
)
185-
plt.xlim([-1, 0])
186-
plt.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
187-
_ = plt.title("Tree predictions")
184+
ax.set_xlim([-1, 0])
185+
ax.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
186+
_ = ax.set_title("Zoom of sample of interest\nin the initial decision tree")
188187

189188
# %% [markdown]
190-
# Now, let's plot the residuals information. We will plot the residuals computed
191-
# from the first decision tree and show the residual predictions.
189+
# Similarly we plot a zoom of the plot with the prediction of the initial residuals
192190

193191
# %%
194-
# Plot the previous information:
195-
# * the residuals committed by the first tree
196-
# * the residual predictions
197-
# * the residuals of the residual predictions
198-
199-
sns.scatterplot(x=data_train["Feature"], y=residuals, color="black", alpha=0.5)
200-
plt.plot(data_test["Feature"], target_test_predicted_residuals, "--")
201-
for value, true, predicted in zip(
202-
data_train["Feature"], residuals, target_train_predicted_residuals
203-
):
204-
lines_residuals = plt.plot([value, value], [true, predicted], color="red")
205-
206-
# Highlight the sample of interest
192+
handles, ax = plot_decision_tree_with_residuals(
193+
residuals,
194+
target_train_predicted_residuals,
195+
target_test_predicted_residuals,
196+
)
207197
plt.scatter(
208198
sample,
209199
target_true_residual,
210200
label="Sample of interest",
211201
color="tab:orange",
212202
s=200,
213203
)
214-
plt.xlim([-1, 0])
215-
plt.legend()
216-
_ = plt.title("Prediction of the residuals")
204+
legend_labels = [
205+
"Predicted residuals",
206+
"Residuals of the\npredicted residuals",
207+
]
208+
ax.set_xlim([-1, 0])
209+
ax.legend(bbox_to_anchor=(1.05, 0.8), loc="upper left")
210+
_ = ax.set_title("Zoom of sample of interest\nin the initial residuals")
217211

218212
# %% [markdown]
219213
# For our sample of interest, our initial tree is making an error (small
220214
# residual). When fitting the second tree, the residual in this case is
221-
# perfectly fitted and predicted. We will quantitatively check this prediction
215+
# perfectly fitted and predicted. We can quantitatively check this prediction
222216
# using the fitted tree. First, let's check the prediction of the initial tree
223217
# and compare it with the true value.
224218

@@ -265,7 +259,9 @@ def generate_data(n_samples=50):
265259
# second tree corrects the first tree's error, while the third tree corrects the
266260
# second tree's error and so on).
267261
#
268-
# We will compare the generalization performance of random-forest and gradient
262+
# ## First comparison of GBDT vs random forests
263+
#
264+
# We now compare the generalization performance of random-forest and gradient
269265
# boosting on the California housing dataset.
270266

271267
# %%
@@ -322,11 +318,12 @@ def generate_data(n_samples=50):
322318
print(f"Average score time: {cv_results_rf['score_time'].mean():.3f} seconds")
323319

324320
# %% [markdown]
325-
# In term of computation performance, the forest can be parallelized and will
321+
# In terms of computing performance, the forest can be parallelized and then
326322
# benefit from using multiple cores of the CPU. In terms of scoring performance,
327323
# both algorithms lead to very close results.
328324
#
329-
# However, we see that the gradient boosting is a very fast algorithm to predict
330-
# compared to random forest. This is due to the fact that gradient boosting uses
331-
# shallow trees. We will go into details in the next notebook about the
332-
# hyperparameters to consider when optimizing ensemble methods.
325+
# However, we see that gradient boosting is overall faster than random forest.
326+
# One of the reasons is that random forests typically rely on deep trees (that
327+
# overfit individually) whereas boosting models build shallow trees (that
328+
# underfit individually) which are faster to fit and predict. In the following
329+
# exercise we will explore more in depth how these two models compare.

0 commit comments

Comments
 (0)