|
7 | 7 |
|
8 | 8 | \newcommand{\titlefigure}{figure/compboost-illustration-2.png}
|
9 | 9 | \newcommand{\learninggoals}{
|
10 |
| - \item Learn the concept of componentwise boosting and its relation to GLM |
| 10 | + \item Learn the concept of componentwise boosting (CWB) |
11 | 11 | \item Understand the built-in feature selection process
|
12 | 12 | \item Understand the problem of fair base learner selection
|
13 | 13 | }
|
|
29 | 29 |
|
30 | 30 | \lz
|
31 | 31 |
|
32 |
| -The aim of componentwise gradient boosting is to find a model that: |
| 32 | +The aim of componentwise gradient boosting (CWB) is to find a model that: |
33 | 33 |
|
34 | 34 | \begin{itemize}
|
35 | 35 | \item
|
|
53 | 53 |
|
54 | 54 | \lz
|
55 | 55 |
|
56 |
| -Because of this, componentwise gradient boosting is also often referred to as \textbf{model-based boosting}. |
| 56 | +Because of this, CWB is also often referred to as \textbf{model-based boosting}. |
57 | 57 |
|
58 | 58 | \end{vbframe}
|
59 | 59 |
|
|
66 | 66 |
|
67 | 67 | \lz
|
68 | 68 |
|
69 |
| -For componentwise gradient boosting we generalize this to multiple base learner sets $\{ \mathcal{B}_1, ... \mathcal{B}_J \}$ with associated parameter spaces |
| 69 | +For CWB we generalize this to multiple base learner sets $\{ \mathcal{B}_1, ... \mathcal{B}_J \}$ with associated parameter spaces |
70 | 70 | $\{ \bm{\Theta}_1, ... \bm{\Theta}_J \}$,
|
71 | 71 | % $$
|
72 | 72 | % % b_j^{[m]}(\xv,\pmb\theta^{[m]}) \quad j = 1,\dots, J\,,
|
|
155 | 155 | \end{vbframe}
|
156 | 156 |
|
157 | 157 |
|
158 |
| - |
159 | 158 | % ------------------------------------------------------------------------------
|
160 | 159 |
|
161 |
| -\begin{vbframe}{relation to glm} |
162 |
| - |
163 |
| -In the simplest case we use linear models (without intercept) on single features |
164 |
| -as base learners: |
165 |
| - |
166 |
| -$$ |
167 |
| - b_j(x_j,\theta) = \theta x_j \quad \text{for } j = 1, 2, \dots, p \quad |
168 |
| - \text{and with } b_j \in \mathcal{B}_j = \{\theta x_j ~\rvert~ \theta \in |
169 |
| - \mathbb{R} \}. |
170 |
| -$$ |
171 |
| - |
172 |
| - |
173 |
| -This definition will result in an ordinary \textbf{linear regression} model. |
174 |
| - |
175 |
| -% .\footnote{Note: a linear model base learner without intercept only makes sense if the covariates are centered (see \texttt{mboost} tutorial, page7)} |
176 |
| - |
| 160 | +\begin{vbframe}{intercept handling} |
177 | 161 |
|
178 | 162 | \begin{itemize}
|
179 |
| - \item Note that linear base learners without intercept only make sense for |
180 |
| - covariates that have been centered before. |
181 |
| - \item If we let the boosting algorithm converge, i.e., let it run for a really |
182 |
| - long time, the parameters will converge to the \textbf{same solution} as the |
183 |
| - ML estimate. |
184 |
| - \item This means that, by specifying a loss function according to the negative |
185 |
| - likelihood of a distribution from an exponential family and defining a link |
186 |
| - function accordingly, this kind of boosting is equivalent to a (regularized) |
187 |
| - \textbf{generalized linear model (GLM)}. |
| 163 | + \item CWB is initialized with a loss-optimal constant $\fm[0](\xv)$ as initial model intercept. |
| 164 | + \item An intercept is often referred to as part of a model which contains information independent of the features. |
| 165 | + \item Suppose linear base learners $b_j(\xv) = \theta_{j1} + \theta_{j2} x_j$ with data independent intercept $\theta_{j1}$ and slope $\theta_{j2}$. |
| 166 | + \item Adding base learner $\hat{b}_j$ in iteration $m$ with parameter estimates $\thetamh = (\hat{\theta}_{j1}^{[m]}, \hat{\theta}_{j1}^{[m]})$ consequently updates the intercept to $\fm[0](\xv) + \hat{\theta}_{j1}^{[m]}$. |
| 167 | + \item Throughout the fitting process, the intercept is adjusted $M$ times to its final form: |
| 168 | + $$ |
| 169 | + \fm[0](\xv) + \sum\limits_{m=1}^M \hat{\theta}^{[m]}_{j^{[m]}1} |
| 170 | + $$ |
188 | 171 | \end{itemize}
|
189 | 172 |
|
190 |
| -\framebreak |
191 |
| - |
192 | 173 | % ------------------------------------------------------------------------------
|
193 | 174 |
|
194 |
| -But: We do not \emph{need} an exponential family and thus are able to fit models |
195 |
| -to all kinds of other distributions and losses, as long as we can calculate (or |
196 |
| -approximate) a derivative of the loss. |
197 |
| -% Note, however, that this does not imply that the algorithm does something |
198 |
| -% meaningful (e.g., non-convex loss functions would still require some |
199 |
| -% additional effort). |
200 |
| - |
201 |
| -\lz |
| 175 | +\framebreak |
202 | 176 |
|
203 |
| -Usually we do not let the boosting model converge fully, but \textbf{stop |
204 |
| -early} for the sake of regularization and feature selection. |
| 177 | +Two possible options to handle the intercept in CWB are: |
205 | 178 |
|
206 |
| -\lz |
| 179 | +\begin{itemize} |
207 | 180 |
|
208 |
| -Even though the resulting model looks like a GLM, we do not have valid standard |
209 |
| -errors for our coefficients, |
210 |
| -so cannot provide confidence or prediction intervals or perform tests etc. |
211 |
| -$\rightarrow$ post-selection inference. |
| 181 | +\item Include an intercept base learner: |
| 182 | + \begin{itemize} |
| 183 | + \item Add base learner $b_{\text{int}} = \theta$ as potential candidate considered in each iteration. |
| 184 | + \item At the same time, remove the intercept from all linear base learners to only use $b_j(\xv) = \theta_j x_j$. |
| 185 | + \item The final intercept is given by $\fm[0](\xv) + \hat{\theta}$. |
| 186 | + \end{itemize} |
| 187 | + \item Include an intercept in each linear base learner $b_j(\xv) = \theta_{j1} + \theta_{j2} x_j$ and accumulate all intercepts to one global intercept after the fitting. |
212 | 188 |
|
213 |
| -\end{vbframe} |
| 189 | +\end{itemize} |
214 | 190 |
|
215 | 191 | % ------------------------------------------------------------------------------
|
216 | 192 |
|
217 |
| -\begin{vbframe}{intercept handling} |
| 193 | +\framebreak |
| 194 | + |
| 195 | +The following figures shows a comparison of the parameter updates with a different intercept handling: |
| 196 | +\begin{center} |
| 197 | +\includegraphics[width = \textwidth]{figure/compboost-intercept-handling.png} |
| 198 | +\end{center} |
| 199 | + |
| 200 | +The used data set is \href{https://github.com/topepo/AmesHousing}{Ames Housing}. |
218 | 201 |
|
219 |
| -\textcolor{red}{@Janek} |
220 | 202 |
|
221 | 203 | \end{vbframe}
|
222 | 204 |
|
|
301 | 283 | \end{vbframe}
|
302 | 284 |
|
303 | 285 |
|
304 |
| - |
305 |
| -\begin{vbframe}{Relation to GLM - continued} |
306 |
| - |
307 |
| -The following figure shows the parameter values after $m \in \{250, 500, 1000, 5000, 10000\}$ iterations as well as the estimates from a linear model as crosses (GLM with normally distributed errors): |
308 |
| - |
309 |
| -\begin{center} |
310 |
| -\includegraphics[width=\textwidth]{figure/compboost-to-glm-iter250.png} |
311 |
| -\end{center} |
312 |
| - |
313 |
| -\end{vbframe} |
314 |
| - |
315 |
| -\begin{vbframe}{Relation to GLM - continued} |
316 |
| - |
317 |
| -The following figure shows the parameter values after $m \in \{250, 500, 1000, 5000, 10000\}$ iterations as well as the estimates from a linear model as crosses (GLM with normally distributed errors): |
318 |
| - |
319 |
| -\begin{center} |
320 |
| -\includegraphics[width=\textwidth]{figure/compboost-to-glm-iter500.png} |
321 |
| -\end{center} |
322 |
| - |
323 |
| -\end{vbframe} |
324 |
| - |
325 |
| -\begin{vbframe}{Relation to GLM - continued} |
326 |
| - |
327 |
| -The following figure shows the parameter values after $m \in \{250, 500, 1000, 5000, 10000\}$ iterations as well as the estimates from a linear model as crosses (GLM with normally distributed errors): |
328 |
| - |
329 |
| -\begin{center} |
330 |
| -\includegraphics[width=\textwidth]{figure/compboost-to-glm-iter1000.png} |
331 |
| -\end{center} |
332 |
| - |
333 |
| -\end{vbframe} |
334 |
| - |
335 |
| -\begin{vbframe}{Relation to GLM - continued} |
336 |
| - |
337 |
| -The following figure shows the parameter values after $m \in \{250, 500, 1000, 5000, 10000\}$ iterations as well as the estimates from a linear model as crosses (GLM with normally distributed errors): |
338 |
| - |
339 |
| -\begin{center} |
340 |
| -\includegraphics[width=\textwidth]{figure/compboost-to-glm-iter5000.png} |
341 |
| -\end{center} |
342 |
| - |
343 |
| -\end{vbframe} |
344 |
| - |
345 |
| -\begin{vbframe}{Relation to GLM - continued} |
346 |
| - |
347 |
| -The following figure shows the parameter values after $m \in \{250, 500, 1000, 5000, 10000\}$ iterations as well as the estimates from a linear model as crosses (GLM with normally distributed errors): |
348 |
| - |
349 |
| -\begin{center} |
350 |
| -\includegraphics[width=\textwidth]{figure/compboost-to-glm-iter10000.png} |
351 |
| -\end{center} |
352 |
| - |
353 |
| -\end{vbframe} |
354 |
| - |
355 |
| - |
356 | 286 | \endlecture
|
357 | 287 | \end{document}
|
0 commit comments