Skip to content

Commit

Permalink
[ci skip] ENH Mention scaling behavior of binning and splines (#739)
Browse files Browse the repository at this point in the history
Co-authored-by: ArturoAmorQ <[email protected]>
Co-authored-by: Olivier Grisel <[email protected]> 767499b
  • Loading branch information
ogrisel and ArturoAmorQ committed Oct 26, 2023
1 parent 442080b commit e0570d1
Show file tree
Hide file tree
Showing 10 changed files with 35 additions and 18 deletions.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,10 @@ def plot_decision_boundary(model, title=None):
# %%
from sklearn.preprocessing import KBinsDiscretizer

classifier = make_pipeline(KBinsDiscretizer(n_bins=5), LogisticRegression())
classifier = make_pipeline(
KBinsDiscretizer(n_bins=5, encode="onehot"), # already the default params
LogisticRegression(),
)
classifier

# %%
Expand Down Expand Up @@ -279,15 +282,20 @@ def plot_decision_boundary(model, title=None):
# We can see that the decision boundary is now smooth, and while it favors
# axis-aligned decision rules when extrapolating in low density regions, it can
# adopt a more curvy decision boundary in the high density regions.
#
# Note however, that the number of knots is a hyperparameter that needs to be
# tuned. If we use too few knots, the model would underfit the data, as shown on
# the moons dataset. If we use too many knots, the model would overfit the data.
#
# However, as for the binning transformation, the model still fails to separate
# the data for the XOR dataset, irrespective of the number of knots, for the
# same reasons: **the spline transformation is a feature-wise transformation**
# and thus **cannot capture interactions** between features.
#
# Take into account that the number of knots is a hyperparameter that needs to be
# tuned. If we use too few knots, the model would underfit the data, as shown on
# the moons dataset. If we use too many knots, the model would overfit the data.
#
# ```{note}
# Notice that `KBinsDiscretizer(encode="onehot")` and `SplineTransformer` do not
# require additional scaling. Indeed, they can replace the scaling step for
# numerical features: they both create features with values in the [0, 1] range.
# ```

# %% [markdown]
#
Expand Down
4 changes: 2 additions & 2 deletions appendix/notebook_timings.html
Original file line number Diff line number Diff line change
Expand Up @@ -1004,9 +1004,9 @@ <h1>Notebook timings<a class="headerlink" href="#notebook-timings" title="Permal
<td><p></p></td>
</tr>
<tr class="row-odd"><td><p><a class="xref doc reference internal" href="../python_scripts/linear_models_feature_engineering_classification.html"><span class="doc">python_scripts/linear_models_feature_engineering_classification</span></a></p></td>
<td><p>2023-10-20 14:15</p></td>
<td><p>2023-10-26 11:59</p></td>
<td><p>cache</p></td>
<td><p>10.8</p></td>
<td><p>10.37</p></td>
<td><p></p></td>
</tr>
<tr class="row-even"><td><p><a class="xref doc reference internal" href="../python_scripts/linear_models_regularization.html"><span class="doc">python_scripts/linear_models_regularization</span></a></p></td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -935,7 +935,10 @@ <h2>Engineering non-linear features<a class="headerlink" href="#engineering-non-
<div class="cell_input docutils container">
<div class="highlight-ipython3 notranslate"><div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">sklearn.preprocessing</span> <span class="kn">import</span> <span class="n">KBinsDiscretizer</span>

<span class="n">classifier</span> <span class="o">=</span> <span class="n">make_pipeline</span><span class="p">(</span><span class="n">KBinsDiscretizer</span><span class="p">(</span><span class="n">n_bins</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> <span class="n">LogisticRegression</span><span class="p">())</span>
<span class="n">classifier</span> <span class="o">=</span> <span class="n">make_pipeline</span><span class="p">(</span>
<span class="n">KBinsDiscretizer</span><span class="p">(</span><span class="n">n_bins</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">encode</span><span class="o">=</span><span class="s2">&quot;onehot&quot;</span><span class="p">),</span> <span class="c1"># already the default params</span>
<span class="n">LogisticRegression</span><span class="p">(),</span>
<span class="p">)</span>
<span class="n">classifier</span>
</pre></div>
</div>
Expand Down Expand Up @@ -999,14 +1002,20 @@ <h2>Engineering non-linear features<a class="headerlink" href="#engineering-non-
</div>
<p>We can see that the decision boundary is now smooth, and while it favors
axis-aligned decision rules when extrapolating in low density regions, it can
adopt a more curvy decision boundary in the high density regions.</p>
<p>Note however, that the number of knots is a hyperparameter that needs to be
tuned. If we use too few knots, the model would underfit the data, as shown on
the moons dataset. If we use too many knots, the model would overfit the data.</p>
<p>However, as for the binning transformation, the model still fails to separate
adopt a more curvy decision boundary in the high density regions.
However, as for the binning transformation, the model still fails to separate
the data for the XOR dataset, irrespective of the number of knots, for the
same reasons: <strong>the spline transformation is a feature-wise transformation</strong>
and thus <strong>cannot capture interactions</strong> between features.</p>
<p>Take into account that the number of knots is a hyperparameter that needs to be
tuned. If we use too few knots, the model would underfit the data, as shown on
the moons dataset. If we use too many knots, the model would overfit the data.</p>
<div class="admonition note">
<p class="admonition-title">Note</p>
<p>Notice that <code class="docutils literal notranslate"><span class="pre">KBinsDiscretizer(encode=&quot;onehot&quot;)</span></code> and <code class="docutils literal notranslate"><span class="pre">SplineTransformer</span></code> do not
require additional scaling. Indeed, they can replace the scaling step for
numerical features: they both create features with values in the [0, 1] range.</p>
</div>
</section>
<section id="modeling-non-additive-feature-interactions">
<h2>Modeling non-additive feature interactions<a class="headerlink" href="#modeling-non-additive-feature-interactions" title="Permalink to this heading">#</a></h2>
Expand Down Expand Up @@ -1084,7 +1093,7 @@ <h2>Modeling non-additive feature interactions<a class="headerlink" href="#model
</div>
</div>
<div class="cell_output docutils container">
<img alt="../_images/d98cf10afde42d00cba794b3555d7e9ba000cebdef829967a70a33ccba1b60db.png" src="../_images/d98cf10afde42d00cba794b3555d7e9ba000cebdef829967a70a33ccba1b60db.png" />
<img alt="../_images/96da411b5c4ebfaa3ebeb9c05c1fa91e8164f132b58558585e11ad4b7d55a671.png" src="../_images/96da411b5c4ebfaa3ebeb9c05c1fa91e8164f132b58558585e11ad4b7d55a671.png" />
</div>
</div>
<p>The polynomial kernel approach would be interesting in cases were the
Expand Down Expand Up @@ -1120,7 +1129,7 @@ <h2>Modeling non-additive feature interactions<a class="headerlink" href="#model
</div>
</div>
<div class="cell_output docutils container">
<img alt="../_images/a830e52976bbc92558a4865dcfe3ea4ee88afda4db3236c609b4f65dca9f6558.png" src="../_images/a830e52976bbc92558a4865dcfe3ea4ee88afda4db3236c609b4f65dca9f6558.png" />
<img alt="../_images/fb409cbf68b13df4149fba8f25d820751b1e5ad85c2612c3fe73045a40c0c004.png" src="../_images/fb409cbf68b13df4149fba8f25d820751b1e5ad85c2612c3fe73045a40c0c004.png" />
</div>
</div>
<p>The resulting decision boundary is <strong>smooth</strong> and can successfully separate
Expand Down Expand Up @@ -1197,7 +1206,7 @@ <h2>Multi-step feature engineering<a class="headerlink" href="#multi-step-featur
</div>
</div>
<div class="cell_output docutils container">
<img alt="../_images/65904496ae44a4a185e7c69818a8751bd821541b100b26091cf76db157f5a3f6.png" src="../_images/65904496ae44a4a185e7c69818a8751bd821541b100b26091cf76db157f5a3f6.png" />
<img alt="../_images/aa30de15e213b5786f4300f81791f9ae43dbe0b3edb40fcea1c7d5ec46154031.png" src="../_images/aa30de15e213b5786f4300f81791f9ae43dbe0b3edb40fcea1c7d5ec46154031.png" />
</div>
</div>
<p>The decision boundary of this pipeline is smooth, but with axis-aligned
Expand Down
2 changes: 1 addition & 1 deletion searchindex.js

Large diffs are not rendered by default.

0 comments on commit e0570d1

Please sign in to comment.