8
8
# %% [markdown]
9
9
# # Gradient-boosting decision tree (GBDT)
10
10
#
11
- # In this notebook, we will present the gradient boosting decision tree
12
- # algorithm and contrast it with AdaBoost.
11
+ # In this notebook, we present the gradient boosting decision tree algorithm.
13
12
#
14
- # Gradient-boosting differs from AdaBoost due to the following reason: instead
15
- # of assigning weights to specific samples, GBDT will fit a decision tree on the
16
- # residuals error (hence the name "gradient") of the previous tree. Therefore,
17
- # each new tree in the ensemble predicts the error made by the previous learner
18
- # instead of predicting the target directly.
13
+ # Even if AdaBoost and GBDT are both boosting algorithms, they are different in
14
+ # nature: the former assigns weights to specific samples, whereas GBDT fits
15
+ # succesive decision trees on the residual errors (hence the name "gradient") of
16
+ # their preceding tree. Therefore, each new tree in the ensemble tries to refine
17
+ # its predictions by specifically addressing the errors made by the previous
18
+ # learner, instead of predicting the target directly.
19
19
#
20
- # In this section, we will provide some intuition about the way learners are
21
- # combined to give the final prediction. In this regard, let's go back to our
22
- # regression problem which is more intuitive for demonstrating the underlying
20
+ # In this section, we provide some intuitions on the way learners are combined
21
+ # to give the final prediction. For such purpose, we tackle a single-feature
22
+ # regression problem, which is more intuitive for demonstrating the underlying
23
23
# machinery.
24
+ #
25
+ # Later in this notebook we compare the performance of GBDT (boosting) with that
26
+ # of a Random Forest (bagging) for a particular dataset.
24
27
25
28
# %%
26
29
import pandas as pd
27
30
import numpy as np
28
31
29
- # Create a random number generator that will be used to set the randomness
30
- rng = np .random .RandomState (0 )
32
+ rng = np .random .RandomState (0 ) # Create a random number generator
31
33
32
34
33
35
def generate_data (n_samples = 50 ):
@@ -60,9 +62,9 @@ def generate_data(n_samples=50):
60
62
_ = plt .title ("Synthetic regression dataset" )
61
63
62
64
# %% [markdown]
63
- # As we previously discussed, boosting will be based on assembling a sequence of
64
- # learners. We will start by creating a decision tree regressor. We will set the
65
- # depth of the tree so that the resulting learner will underfit the data.
65
+ # As we previously discussed, boosting is based on assembling a sequence of
66
+ # learners. We start by creating a decision tree regressor. We set the depth of
67
+ # the tree to underfit the data on purpose .
66
68
67
69
# %%
68
70
from sklearn .tree import DecisionTreeRegressor
@@ -74,45 +76,61 @@ def generate_data(n_samples=50):
74
76
target_test_predicted = tree .predict (data_test )
75
77
76
78
# %% [markdown]
77
- # Using the term "test" here refers to data that was not used for training. It
78
- # should not be confused with data coming from a train-test split, as it was
79
- # generated in equally-spaced intervals for the visual evaluation of the
80
- # predictions.
79
+ # Using the term "test" here refers to data not used for training. It should not
80
+ # be confused with data coming from a train-test split, as it was generated in
81
+ # equally-spaced intervals for the visual evaluation of the predictions.
82
+ #
83
+ # To avoid writing the same code in multiple places we define a helper function
84
+ # to plot the data samples as well as the decision tree predictions and
85
+ # residuals.
86
+
81
87
82
88
# %%
83
- # plot the data
84
- sns .scatterplot (
85
- x = data_train ["Feature" ], y = target_train , color = "black" , alpha = 0.5
86
- )
87
- # plot the predictions
88
- line_predictions = plt .plot (data_test ["Feature" ], target_test_predicted , "--" )
89
+ def plot_decision_tree_with_residuals (y_train , y_train_pred , y_test_pred ):
90
+ # Create a plot and get the Axes object
91
+ fig , ax = plt .subplots ()
92
+ # plot the data
93
+ sns .scatterplot (
94
+ x = data_train ["Feature" ], y = y_train , color = "black" , alpha = 0.5 , ax = ax
95
+ )
96
+ # plot the predictions
97
+ line_predictions = ax .plot (data_test ["Feature" ], y_test_pred , "--" )
98
+
99
+ # plot the residuals
100
+ for value , true , predicted in zip (
101
+ data_train ["Feature" ], y_train , y_train_pred
102
+ ):
103
+ lines_residuals = ax .plot (
104
+ [value , value ], [true , predicted ], color = "red"
105
+ )
106
+
107
+ handles = [line_predictions [0 ], lines_residuals [0 ]]
108
+
109
+ return handles , ax
89
110
90
- # plot the residuals
91
- for value , true , predicted in zip (
92
- data_train ["Feature" ], target_train , target_train_predicted
93
- ):
94
- lines_residuals = plt .plot ([value , value ], [true , predicted ], color = "red" )
95
111
96
- plt . legend (
97
- [ line_predictions [ 0 ], lines_residuals [ 0 ]], [ "Fitted tree" , "Residuals" ]
112
+ handles , ax = plot_decision_tree_with_residuals (
113
+ target_train , target_train_predicted , target_test_predicted
98
114
)
99
- _ = plt .title ("Prediction function together \n with errors on the training set" )
115
+ legend_labels = ["Initial decision tree" , "Initial residuals" ]
116
+ ax .legend (handles , legend_labels , bbox_to_anchor = (1.05 , 0.8 ), loc = "upper left" )
117
+ _ = ax .set_title ("Decision Tree together \n with errors on the training set" )
100
118
101
119
# %% [markdown]
102
120
# ```{tip}
103
121
# In the cell above, we manually edited the legend to get only a single label
104
122
# for all the residual lines.
105
123
# ```
106
124
# Since the tree underfits the data, its accuracy is far from perfect on the
107
- # training data. We can observe this in the figure by looking at the difference
108
- # between the predictions and the ground-truth data. We represent these errors,
109
- # called "Residuals ", by unbroken red lines.
125
+ # training data. We can observe this in the figure above by looking at the
126
+ # difference between the predictions and the ground-truth data. We represent
127
+ # these errors, called "residuals ", using solid red lines.
110
128
#
111
- # Indeed, our initial tree was not expressive enough to handle the complexity of
129
+ # Indeed, our initial tree is not expressive enough to handle the complexity of
112
130
# the data, as shown by the residuals. In a gradient-boosting algorithm, the
113
- # idea is to create a second tree which, given the same data `data`, will try to
114
- # predict the residuals instead of the vector `target`. We would therefore have
115
- # a tree that is able to predict the errors made by the initial tree.
131
+ # idea is to create a second tree which, given the same `data`, tries to predict
132
+ # the residuals instead of the vector `target`, i.e. we have a second tree that
133
+ # is able to predict the errors made by the initial tree.
116
134
#
117
135
# Let's train such a tree.
118
136
@@ -126,29 +144,22 @@ def generate_data(n_samples=50):
126
144
target_test_predicted_residuals = tree_residuals .predict (data_test )
127
145
128
146
# %%
129
- sns .scatterplot (x = data_train ["Feature" ], y = residuals , color = "black" , alpha = 0.5 )
130
- line_predictions = plt .plot (
131
- data_test ["Feature" ], target_test_predicted_residuals , "--"
147
+ handles , ax = plot_decision_tree_with_residuals (
148
+ residuals ,
149
+ target_train_predicted_residuals ,
150
+ target_test_predicted_residuals ,
132
151
)
133
-
134
- # plot the residuals of the predicted residuals
135
- for value , true , predicted in zip (
136
- data_train ["Feature" ], residuals , target_train_predicted_residuals
137
- ):
138
- lines_residuals = plt .plot ([value , value ], [true , predicted ], color = "red" )
139
-
140
- plt .legend (
141
- [line_predictions [0 ], lines_residuals [0 ]],
142
- ["Fitted tree" , "Residuals" ],
143
- bbox_to_anchor = (1.05 , 0.8 ),
144
- loc = "upper left" ,
145
- )
146
- _ = plt .title ("Prediction of the previous residuals" )
152
+ legend_labels = [
153
+ "Predicted residuals" ,
154
+ "Residuals of the\n predicted residuals" ,
155
+ ]
156
+ ax .legend (handles , legend_labels , bbox_to_anchor = (1.05 , 0.8 ), loc = "upper left" )
157
+ _ = ax .set_title ("Prediction of the initial residuals" )
147
158
148
159
# %% [markdown]
149
- # We see that this new tree only manages to fit some of the residuals. We will
150
- # focus on a specific sample from the training set (i.e. we know that the sample
151
- # will be well predicted using two successive trees). We will use this sample to
160
+ # We see that this new tree only manages to fit some of the residuals. We now
161
+ # focus on a specific sample from the training set (as we know that the sample
162
+ # can be well predicted using two successive trees). We will use this sample to
152
163
# explain how the predictions of both trees are combined. Let's first select
153
164
# this sample in `data_train`.
154
165
@@ -159,66 +170,49 @@ def generate_data(n_samples=50):
159
170
target_true_residual = residuals .iloc [- 2 ]
160
171
161
172
# %% [markdown]
162
- # Let's plot the previous information and highlight our sample of interest.
163
- # Let's start by plotting the original data and the prediction of the first
164
- # decision tree.
173
+ # Let's plot the original data, the predictions of the initial decision tree and
174
+ # highlight our sample of interest, i.e. this is just a zoom of the plot
175
+ # displaying the initial shallow tree.
165
176
166
177
# %%
167
- # Plot the previous information:
168
- # * the dataset
169
- # * the predictions
170
- # * the residuals
171
-
172
- sns .scatterplot (
173
- x = data_train ["Feature" ], y = target_train , color = "black" , alpha = 0.5
178
+ handles , ax = plot_decision_tree_with_residuals (
179
+ target_train , target_train_predicted , target_test_predicted
174
180
)
175
- plt .plot (data_test ["Feature" ], target_test_predicted , "--" )
176
- for value , true , predicted in zip (
177
- data_train ["Feature" ], target_train , target_train_predicted
178
- ):
179
- lines_residuals = plt .plot ([value , value ], [true , predicted ], color = "red" )
180
-
181
- # Highlight the sample of interest
182
- plt .scatter (
181
+ ax .scatter (
183
182
sample , target_true , label = "Sample of interest" , color = "tab:orange" , s = 200
184
183
)
185
- plt . xlim ([- 1 , 0 ])
186
- plt .legend (bbox_to_anchor = (1.05 , 0.8 ), loc = "upper left" )
187
- _ = plt . title ( "Tree predictions " )
184
+ ax . set_xlim ([- 1 , 0 ])
185
+ ax .legend (bbox_to_anchor = (1.05 , 0.8 ), loc = "upper left" )
186
+ _ = ax . set_title ( "Zoom of sample of interest \n in the initial decision tree " )
188
187
189
188
# %% [markdown]
190
- # Now, let's plot the residuals information. We will plot the residuals computed
191
- # from the first decision tree and show the residual predictions.
189
+ # Similarly we plot a zoom of the plot with the prediction of the initial residuals
192
190
193
191
# %%
194
- # Plot the previous information:
195
- # * the residuals committed by the first tree
196
- # * the residual predictions
197
- # * the residuals of the residual predictions
198
-
199
- sns .scatterplot (x = data_train ["Feature" ], y = residuals , color = "black" , alpha = 0.5 )
200
- plt .plot (data_test ["Feature" ], target_test_predicted_residuals , "--" )
201
- for value , true , predicted in zip (
202
- data_train ["Feature" ], residuals , target_train_predicted_residuals
203
- ):
204
- lines_residuals = plt .plot ([value , value ], [true , predicted ], color = "red" )
205
-
206
- # Highlight the sample of interest
192
+ handles , ax = plot_decision_tree_with_residuals (
193
+ residuals ,
194
+ target_train_predicted_residuals ,
195
+ target_test_predicted_residuals ,
196
+ )
207
197
plt .scatter (
208
198
sample ,
209
199
target_true_residual ,
210
200
label = "Sample of interest" ,
211
201
color = "tab:orange" ,
212
202
s = 200 ,
213
203
)
214
- plt .xlim ([- 1 , 0 ])
215
- plt .legend ()
216
- _ = plt .title ("Prediction of the residuals" )
204
+ legend_labels = [
205
+ "Predicted residuals" ,
206
+ "Residuals of the\n predicted residuals" ,
207
+ ]
208
+ ax .set_xlim ([- 1 , 0 ])
209
+ ax .legend (bbox_to_anchor = (1.05 , 0.8 ), loc = "upper left" )
210
+ _ = ax .set_title ("Zoom of sample of interest\n in the initial residuals" )
217
211
218
212
# %% [markdown]
219
213
# For our sample of interest, our initial tree is making an error (small
220
214
# residual). When fitting the second tree, the residual in this case is
221
- # perfectly fitted and predicted. We will quantitatively check this prediction
215
+ # perfectly fitted and predicted. We can quantitatively check this prediction
222
216
# using the fitted tree. First, let's check the prediction of the initial tree
223
217
# and compare it with the true value.
224
218
@@ -265,7 +259,9 @@ def generate_data(n_samples=50):
265
259
# second tree corrects the first tree's error, while the third tree corrects the
266
260
# second tree's error and so on).
267
261
#
268
- # We will compare the generalization performance of random-forest and gradient
262
+ # ## First comparison of GBDT vs random forests
263
+ #
264
+ # We now compare the generalization performance of random-forest and gradient
269
265
# boosting on the California housing dataset.
270
266
271
267
# %%
@@ -322,11 +318,12 @@ def generate_data(n_samples=50):
322
318
print (f"Average score time: { cv_results_rf ['score_time' ].mean ():.3f} seconds" )
323
319
324
320
# %% [markdown]
325
- # In term of computation performance, the forest can be parallelized and will
321
+ # In terms of computing performance, the forest can be parallelized and then
326
322
# benefit from using multiple cores of the CPU. In terms of scoring performance,
327
323
# both algorithms lead to very close results.
328
324
#
329
- # However, we see that the gradient boosting is a very fast algorithm to predict
330
- # compared to random forest. This is due to the fact that gradient boosting uses
331
- # shallow trees. We will go into details in the next notebook about the
332
- # hyperparameters to consider when optimizing ensemble methods.
325
+ # However, we see that gradient boosting is overall faster than random forest.
326
+ # One of the reasons is that random forests typically rely on deep trees (that
327
+ # overfit individually) whereas boosting models build shallow trees (that
328
+ # underfit individually) which are faster to fit and predict. In the following
329
+ # exercise we will explore more in depth how these two models compare.
0 commit comments