Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify encoding pipeline #820

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion book/chapters/chapter9/preprocessing.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ In this book, preprocessing refers to everything that happens with *data* before
Another aspect of preprocessing is `r index('feature engineering', aside = TRUE)`, which covers all other transformations of data before it is fed to the machine learning model, including the creation of features from possibly unstructured data, such as written text, sequences or images.
The goal of feature engineering is to enable the data to be handled by a given learner, and/or to further improve predictive performance.
It is important to note that feature engineering helps mostly for simpler algorithms, while highly complex models usually gain less from it and require little data preparation to be trained.
Common difficulties in data that can be solved with feature engineering include features with skewed distributions, high cardinality categorical features, missing observations, high dimensionality and imbalanced classes in classification tasks.
Common difficulties in data that can be solved with feature engineering include features with skewed distributions, high-cardinality categorical features, missing observations, high dimensionality and imbalanced classes in classification tasks.
Deep learning has shown promising results in automating feature engineering, however, its effectiveness depends on the complexity and nature of the data being processed, as well as the specific problem being addressed.
Typically it can work well with natural language processing and computer vision problems, while for standard tabular data, tree-based ensembles such as a random forest or gradient boosting are often still superior (and easier to handle). However, tabular deep learning approaches are currently catching up quickly.
Hence, manual feature engineering is still often required but with `mlr3pipelines`, which can simplify the process as much as possible.
Expand Down Expand Up @@ -151,6 +151,10 @@ factor_pipeline =
affect_columns = selector_type("factor"), id = "binary_enc")
```

The order in which operations are performed matters here: `po("encodeimpact")` converts high-cardinality `factor` type features into `numeric` features, so these will not be affected by the `po("encode")` operators that come afterwards.
Therefore, the one-hot encoding PipeOp does not need to specify *not* to affect high-cardinality features.
Likewise, once the treatment encoding PipeOp sees the data, all non-binary `factor` features have been converted, so it will only affect binary factors by default.

Now we can apply this pipeline to our xgboost model to use it in a benchmark experiment; we also compare a simpler pipeline that only uses one-hot encoding to demonstrate performance differences resulting from different strategies.

```{r preprocessing-013, message=FALSE}
Expand Down
Loading