From a9d2bccfa9e68638bc4c335786a0a16d4845cce2 Mon Sep 17 00:00:00 2001 From: mb706 Date: Thu, 30 May 2024 17:04:39 +0200 Subject: [PATCH 1/2] Update preprocessing.qmd --- book/chapters/chapter9/preprocessing.qmd | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/book/chapters/chapter9/preprocessing.qmd b/book/chapters/chapter9/preprocessing.qmd index fd7905ebf..00a00bd4a 100644 --- a/book/chapters/chapter9/preprocessing.qmd +++ b/book/chapters/chapter9/preprocessing.qmd @@ -20,7 +20,7 @@ In this book, preprocessing refers to everything that happens with *data* before Another aspect of preprocessing is `r index('feature engineering', aside = TRUE)`, which covers all other transformations of data before it is fed to the machine learning model, including the creation of features from possibly unstructured data, such as written text, sequences or images. The goal of feature engineering is to enable the data to be handled by a given learner, and/or to further improve predictive performance. It is important to note that feature engineering helps mostly for simpler algorithms, while highly complex models usually gain less from it and require little data preparation to be trained. -Common difficulties in data that can be solved with feature engineering include features with skewed distributions, high cardinality categorical features, missing observations, high dimensionality and imbalanced classes in classification tasks. +Common difficulties in data that can be solved with feature engineering include features with skewed distributions, high-cardinality categorical features, missing observations, high dimensionality and imbalanced classes in classification tasks. Deep learning has shown promising results in automating feature engineering, however, its effectiveness depends on the complexity and nature of the data being processed, as well as the specific problem being addressed. Typically it can work well with natural language processing and computer vision problems, while for standard tabular data, tree-based ensembles such as a random forest or gradient boosting are often still superior (and easier to handle). However, tabular deep learning approaches are currently catching up quickly. Hence, manual feature engineering is still often required but with `mlr3pipelines`, which can simplify the process as much as possible. @@ -151,6 +151,10 @@ factor_pipeline = affect_columns = selector_type("factor"), id = "binary_enc") ``` +The order in which operations are performed matters here: `po("encodeimpact")` converts high-cardinality `factor` type features into `numeric` features, so they will not be affected by the `po("encode")` operators that come afterwards. +Therefore, the one-hot encoding PipeOp does not need to specify *not* to affect high-cardinality features. +Likewise, once the treatment encoding PipeOp sees the data, all non-binary `factor` features have been converted, so it will only affect binary factors by default. + Now we can apply this pipeline to our xgboost model to use it in a benchmark experiment; we also compare a simpler pipeline that only uses one-hot encoding to demonstrate performance differences resulting from different strategies. ```{r preprocessing-013, message=FALSE} From 14157a2c0df51c93c40792c6598677cf156724d2 Mon Sep 17 00:00:00 2001 From: mb706 Date: Thu, 30 May 2024 17:06:52 +0200 Subject: [PATCH 2/2] Update preprocessing.qmd --- book/chapters/chapter9/preprocessing.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/book/chapters/chapter9/preprocessing.qmd b/book/chapters/chapter9/preprocessing.qmd index 00a00bd4a..95a7e1632 100644 --- a/book/chapters/chapter9/preprocessing.qmd +++ b/book/chapters/chapter9/preprocessing.qmd @@ -151,7 +151,7 @@ factor_pipeline = affect_columns = selector_type("factor"), id = "binary_enc") ``` -The order in which operations are performed matters here: `po("encodeimpact")` converts high-cardinality `factor` type features into `numeric` features, so they will not be affected by the `po("encode")` operators that come afterwards. +The order in which operations are performed matters here: `po("encodeimpact")` converts high-cardinality `factor` type features into `numeric` features, so these will not be affected by the `po("encode")` operators that come afterwards. Therefore, the one-hot encoding PipeOp does not need to specify *not* to affect high-cardinality features. Likewise, once the treatment encoding PipeOp sees the data, all non-binary `factor` features have been converted, so it will only affect binary factors by default.