From 0f681100cbfe8603b248fc2de6481ee208db0157 Mon Sep 17 00:00:00 2001 From: Joscelin Rocha Hidalgo Date: Thu, 15 Aug 2024 13:29:40 -0700 Subject: [PATCH] Modify documentation to match tidy formatting Fixed some typos and formatting issues throughout the vignettes and articles. Backticks for package names were removed, addressing issue #218 Unsure of the following uses of the backticks: - referring to objects and attributes - referring to variable names --- README.Rmd | 2 +- vignettes/articles/workflowsets.Rmd | 38 +++++++++++++------------ vignettes/basics.Rmd | 44 ++++++++++++++--------------- vignettes/classification.Rmd | 17 ++++++----- 4 files changed, 51 insertions(+), 50 deletions(-) diff --git a/README.Rmd b/README.Rmd index c39a2943..014cbf98 100644 --- a/README.Rmd +++ b/README.Rmd @@ -89,4 +89,4 @@ This project is released with a [Contributor Code of Conduct](https://github.com - Check out further details on [contributing guidelines for tidymodels packages](https://www.tidymodels.org/contribute/) and [how to get help](https://www.tidymodels.org/help/). -In the stacks package, some test objects take too long to build with every commit. If your contribution changes the structure of `data_stack` or `model_stacks` objects, please regenerate these test objects by running the scripts in `man-roxygen/example_models.Rmd`, including those with chunk options `eval = FALSE`. +In the stacks package, some test objects take too long to build with every commit. If your contribution changes the structure of data_stack or model_stacks objects, please regenerate these test objects by running the scripts in `man-roxygen/example_models.Rmd`, including those with chunk options `eval = FALSE`. diff --git a/vignettes/articles/workflowsets.Rmd b/vignettes/articles/workflowsets.Rmd index 1c72821f..9a0ac36d 100644 --- a/vignettes/articles/workflowsets.Rmd +++ b/vignettes/articles/workflowsets.Rmd @@ -73,7 +73,7 @@ In this example, we'll again make use of the `tree_frogs` data exported with `st Red-eyed tree frog (RETF) embryos can hatch earlier than their normal 7ish days if they detect potential predator threat. Researchers wanted to determine how, and when, these tree frog embryos were able to detect stimulus from their environment. To do so, they subjected the embryos at varying developmental stages to "predator stimulus" by jiggling the embryos with a blunt probe. Beforehand, though some of the embryos were treated with gentamicin, a compound that knocks out their lateral line (a sensory organ.) Researcher Julie Jung and her crew found that these factors inform whether an embryo hatches prematurely or not! -We'll start out with predicting `latency` (i.e. time to hatch) based on other attributes. We'll need to filter out NAs (i.e. cases where the embryo did not hatch) first. +We'll start out with predicting "latency" (i.e., time to hatch) based on other attributes. We'll need to filter out NAs (i.e., cases where the embryo did not hatch) first. ```{r, message = FALSE, warning = FALSE} data("tree_frogs") @@ -108,12 +108,12 @@ First, splitting up the training data, generating resamples, and setting some op set.seed(1) tree_frogs_split <- initial_split(tree_frogs) tree_frogs_train <- training(tree_frogs_split) -tree_frogs_test <- testing(tree_frogs_split) +tree_frogs_test <- testing(tree_frogs_split) set.seed(1) folds <- rsample::vfold_cv(tree_frogs_train, v = 5) -tree_frogs_rec <- +tree_frogs_rec <- recipe(latency ~ ., data = tree_frogs_train) metric <- metric_set(rmse) @@ -133,7 +133,7 @@ Starting out with K-nearest neighbors, we begin by creating a parsnip model spec # create a model specification knn_spec <- nearest_neighbor( - mode = "regression", + mode = "regression", neighbors = tune("k") ) %>% set_engine("kknn") @@ -178,9 +178,9 @@ Finally, putting together the model specification and recipe for the support vec ```{r} # create a model specification -svm_spec <- +svm_spec <- svm_rbf( - cost = tune("cost"), + cost = tune("cost"), rbf_sigma = tune("sigma") ) %>% set_engine("kernlab") %>% @@ -196,20 +196,20 @@ svm_rec <- step_normalize(all_numeric_predictors()) ``` -With each model specification and accompanying recipe now defined, we can combine them via `workflow_set`: +With each model specification and accompanying recipe now defined, we can combine them via `workflow_set()`: ```{r} -wf_set <- +wf_set <- workflow_set( - preproc = list(rec1 = knn_rec, rec2 = lin_reg_rec, rec3 = svm_rec), - models = list(knn = knn_spec, lin_reg = lin_reg_spec, svm = svm_spec), + preproc = list(rec1 = knn_rec, rec2 = lin_reg_rec, rec3 = svm_rec), + models = list(knn = knn_spec, lin_reg = lin_reg_spec, svm = svm_spec), cross = FALSE ) wf_set ``` -Note that each combination of preprocessor and model specification is assigned a `wflow_id` that we can use to interface with individual model definitions: +Note that each combination of preprocessor and model specification is assigned a wflow_id that we can use to interface with individual model definitions: ```{r} wf_set %>% @@ -258,7 +258,7 @@ With these three model definitions fully specified and tuned in a workflow set, Building the stacked ensemble, now, takes even fewer lines than it did with individual workflows: ```{r, message = FALSE, warning = FALSE} -tree_frogs_model_st <- +tree_frogs_model_st <- # initialize the stack stacks() %>% # add candidate members @@ -279,7 +279,7 @@ To make sure that we have the right trade-off between minimizing the number of m autoplot(tree_frogs_model_st) ``` -If these results were not good enough, `blend_predictions()` could be called again with different values of `penalty`. As it is, `blend_predictions()` picks the penalty parameter with the numerically optimal results. To see the top results: +If these results were not good enough, `blend_predictions()` could be called again with different values of "penalty." As it is, `blend_predictions()` picks the penalty parameter with the numerically optimal results. To see the top results: ```{r weight-plot, fig.alt = "A ggplot bar plot, giving the stacking coefficient on the x axis and member on the y axis. There are three members in this ensemble, where a nearest neighbor is weighted most heavily, followed by a linear regression with a stacking coefficient about half as large, followed by a support vector machine with a very small contribution."} autoplot(tree_frogs_model_st, type = "weights") @@ -294,7 +294,7 @@ collect_parameters(tree_frogs_model_st, "rec3_svm") This object is now ready to predict with new data! ```{r} -tree_frogs_test <- +tree_frogs_test <- tree_frogs_test %>% bind_cols(predict(tree_frogs_model_st, .)) ``` @@ -303,16 +303,18 @@ Juxtaposing the predictions with the true data: ```{r, fig.alt = "A ggplot scatterplot showing observed versus predicted latency values. While there is indeed a positive and roughly linear relationship, there is certainly patterned structure in the residuals."} ggplot(tree_frogs_test) + - aes(x = latency, - y = .pred) + - geom_point() + + aes( + x = latency, + y = .pred + ) + + geom_point() + coord_obs_pred() ``` Looks like our predictions were decent! How do the stacks predictions perform, though, as compared to the members' predictions? We can use the `type = "members"` argument to generate predictions from each of the ensemble members. ```{r} -member_preds <- +member_preds <- tree_frogs_test %>% select(latency) %>% bind_cols(predict(tree_frogs_model_st, tree_frogs_test, members = TRUE)) diff --git a/vignettes/basics.Rmd b/vignettes/basics.Rmd index 331948da..a7771947 100644 --- a/vignettes/basics.Rmd +++ b/vignettes/basics.Rmd @@ -16,12 +16,12 @@ knitr::opts_chunk$set( In this article, we'll be working through an example of the workflow of model stacking with the stacks package. At a high level, the workflow looks something like this: -1. Define candidate ensemble members using functionality from rsample, parsnip, workflows, recipes, and tune -2. Initialize a `data_stack` object with `stacks()` -3. Iteratively add candidate ensemble members to the `data_stack` with `add_candidates()` -4. Evaluate how to combine their predictions with `blend_predictions()` -5. Fit candidate ensemble members with non-zero stacking coefficients with `fit_members()` -6. Predict on new data with `predict()`! +1. Define candidate ensemble members using functionality from rsample, parsnip, workflows, recipes, and tune +2. Initialize a data_stack object with `stacks()` +3. Iteratively add candidate ensemble members to the data_stack with `add_candidates()` +4. Evaluate how to combine their predictions with `blend_predictions()` +5. Fit candidate ensemble members with non-zero stacking coefficients with `fit_members()` +6. Predict on new data with `predict()`! The package is closely integrated with the rest of the functionality in tidymodels—we'll load those packages as well, in addition to some tidyverse packages to evaluate our results later on. @@ -57,11 +57,11 @@ knitr::opts_chunk$set( ) ``` -In this example, we'll make use of the `tree_frogs` data exported with `stacks`, giving experimental results on hatching behavior of red-eyed tree frog embryos! +In this example, we'll make use of the `tree_frogs` data exported with stacks, giving experimental results on hatching behavior of red-eyed tree frog embryos! Red-eyed tree frog (RETF) embryos can hatch earlier than their normal 7ish days if they detect potential predator threat. Researchers wanted to determine how, and when, these tree frog embryos were able to detect stimulus from their environment. To do so, they subjected the embryos at varying developmental stages to "predator stimulus" by jiggling the embryos with a blunt probe. Beforehand, though some of the embryos were treated with gentamicin, a compound that knocks out their lateral line (a sensory organ.) Researcher Julie Jung and her crew found that these factors inform whether an embryo hatches prematurely or not! -We'll start out with predicting `latency` (i.e. time to hatch) based on other attributes. We'll need to filter out NAs (i.e. cases where the embryo did not hatch) first. +We'll start out with predicting "latency" (i.e., time to hatch) based on other attributes. We'll need to filter out NAs (i.e., cases where the embryo did not hatch) first. ```{r, message = FALSE, warning = FALSE} data("tree_frogs") @@ -82,20 +82,21 @@ ggplot(tree_frogs) + geom_point() + labs(x = "Embryo Age (s)", y = "Time to Hatch (s)", col = "Treatment") ``` + Let's give this a go! # Define candidate ensemble members -At the highest level, ensembles are formed from _model definitions_. In this package, model definitions are an instance of a minimal [`workflow`](https://workflows.tidymodels.org/), containing a _model specification_ (as defined in the [`parsnip`](https://parsnip.tidymodels.org/) package) and, optionally, a _preprocessor_ (as defined in the [`recipes`](https://recipes.tidymodels.org/) package). Model definitions specify the form of candidate ensemble members. +At the highest level, ensembles are formed from *model definitions*. In this package, model definitions are an instance of a minimal [`workflow`](https://workflows.tidymodels.org/), containing a *model specification* (as defined in the [`parsnip`](https://parsnip.tidymodels.org/) package) and, optionally, a *preprocessor* (as defined in the [`recipes`](https://recipes.tidymodels.org/) package). Model definitions specify the form of candidate ensemble members. ```{r, echo = FALSE, fig.alt = "A diagram representing 'model definitions,' which specify the form of candidate ensemble members. Three colored boxes represent three different model types; a K-nearest neighbors model (in salmon), a linear regression model (in yellow), and a support vector machine model (in green)."} knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/model_defs.png") ``` -Defining the constituent model definitions is undoubtedly the longest part of building an ensemble with `stacks`. If you're familiar with tidymodels "proper," you're probably fine to skip this section, keeping a few things in mind: +Defining the constituent model definitions is undoubtedly the longest part of building an ensemble with stacks. If you're familiar with tidymodels "proper," you're probably fine to skip this section, keeping a few things in mind: -* You'll need to save the assessment set predictions and workflow utilized in your `tune_grid()`, `tune_bayes()`, or `fit_resamples()` objects by setting the `control` arguments `save_pred = TRUE` and `save_workflow = TRUE`. Note the use of the `control_stack_*()` convenience functions below! -* Each model definition must share the same rsample `rset` object. +- You'll need to save the assessment set predictions and workflow utilized in your `tune_grid()`, `tune_bayes()`, or `fit_resamples()` objects by setting the control arguments `save_pred = TRUE` and `save_workflow = TRUE`. Note the use of the `control_stack_*()` convenience functions below! +- Each model definition must share the same rsample rset object. We'll first start out with splitting up the training data, generating resamples, and setting some options that will be used by each model definition. @@ -115,7 +116,7 @@ tree_frogs_rec <- metric <- metric_set(rmse) ``` -Tuning and fitting results for use in ensembles need to be fitted with the control arguments `save_pred = TRUE` and `save_workflow = TRUE`—these settings ensure that the assessment set predictions, as well as the workflow used to fit the resamples, are stored in the resulting object. For convenience, stacks supplies some `control_stack_*()` functions to generate the appropriate objects for you. +Tuning and fitting results for use in ensembles need to be fitted with the control arguments `save_pred = TRUE` and `save_workflow = TRUE`—these settings ensure that the assessment set predictions, as well as the workflow used to fit the resamples, are stored in the resulting object. For convenience, stacks supplies some `control_stack_*()` functions to generate the appropriate objects for you. In this example, we'll be working with `tune_grid()` and `fit_resamples()` from the tune package, so we will use the following control settings: @@ -126,7 +127,7 @@ ctrl_res <- control_stack_resamples() We'll define three different model definitions to try to predict time to hatch—a K-nearest neighbors model (with hyperparameters to tune), a linear model, and a support vector machine model (again, with hyperparameters to tune). -Starting out with K-nearest neighbors, we begin by creating a `parsnip` model specification: +Starting out with K-nearest neighbors, we begin by creating a parsnip model specification: ```{r} # create a model definition @@ -187,7 +188,7 @@ knn_res <- knn_res ``` -This `knn_res` object fully specifies the candidate members, and is ready to be included in a `stacks` workflow. +This knn_res object fully specifies the candidate members, and is ready to be included in a stacks workflow. Now, specifying the linear model, note that we are not optimizing over any hyperparameters. Thus, we use the `fit_resamples()` function rather than `tune_grid()` or `tune_bayes()` when fitting to our resamples. @@ -269,11 +270,11 @@ Altogether, we've created three model definitions, where the K-nearest neighbors knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/candidates.png") ``` -With these three model definitions fully specified, we are ready to begin stacking these model configurations. (Note that, in most applied settings, one would likely specify many more than 11 candidate members.) +With these three model definitions fully specified, we are ready to begin stacking these model configurations. Note that, in most applied settings, one would likely specify many more than 11 candidate members. # Putting together a stack -The first step to building an ensemble with stacks is to create a `data_stack` object—in this package, data stacks are tibbles (with some extra attributes) that contain the assessment set predictions for each candidate ensemble member. +The first step to building an ensemble with stacks is to create a data_stack object—in this package, data stacks are tibbles (with some extra attributes) that contain the assessment set predictions for each candidate ensemble member. ```{r, echo = FALSE, fig.alt = "A diagram representing a 'data stack,' a specific kind of data frame. Colored 'columns' depict, in white, the true value of the outcome variable in the validation set, followed by four columns (in salmon) representing the predictions from the K-nearest neighbors model, one column (in tan) representing the linear regression model, and six (in green) representing the support vector machine model."} knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/data_stack.png") @@ -319,25 +320,25 @@ tree_frogs_model_st <- blend_predictions() ``` -The `blend_predictions` function determines how member model output will ultimately be combined in the final prediction by fitting a LASSO model on the data stack, predicting the true assessment set outcome using the predictions from each of the candidate members. Candidates with nonzero stacking coefficients become members. +The `blend_predictions()` function determines how member model output will ultimately be combined in the final prediction by fitting a LASSO model on the data stack, predicting the true assessment set outcome using the predictions from each of the candidate members. Candidates with nonzero stacking coefficients become members. ```{r, echo = FALSE, fig.alt = "A diagram representing 'stacking coefficients,' the coefficients of the linear model combining each of the candidate member predictions to generate the ensemble's ultimate prediction. Boxes for each of the candidate members are placed besides each other, filled in with color if the coefficient for the associated candidate member is nonzero."} knitr::include_graphics("https://raw.githubusercontent.com/tidymodels/stacks/main/man/figures/coefs.png") ``` -To make sure that we have the right trade-off between minimizing the number of members and optimizing performance, we can use the `autoplot()` method: +To make sure that we have the right trade-off between minimizing the number of members and optimizing performance, we can use the `autoplot()` method: ```{r penalty-plot} autoplot(tree_frogs_model_st) ``` -To show the relationship more directly: +To show the relationship more directly: ```{r members-plot, fig.alt = "A ggplot line plot. The x axis shows the degree of penalization, ranging from 1e-06 to 1e-01, and the y axis displays the mean of three different metrics. The plots are faceted by metric type, with three facets: number of members, root mean squared error, and R squared. The plots generally show that, as penalization increases, the error decreases. There are very few proposed members in this example, so penalization doesn't drive down the number of members much at all. In this case, then, a larger penalty is acceptable."} autoplot(tree_frogs_model_st, type = "members") ``` -If these results were not good enough, `blend_predictions()` could be called again with different values of `penalty`. As it is, `blend_predictions()` picks the penalty parameter with the numerically optimal results. To see the top results: +If these results were not good enough, `blend_predictions()` could be called again with different values of "penalty." As it is, `blend_predictions()` picks the penalty parameter with the numerically optimal results. To see the top results: ```{r weight-plot, fig.alt = "A ggplot bar plot, giving the stacking coefficient on the x axis and member on the y axis. There are three members in this ensemble, where a nearest neighbor is weighted most heavily, followed by a linear regression with a stacking coefficient about half as large, followed by a support vector machine with a very small contribution."} autoplot(tree_frogs_model_st, type = "weights") @@ -410,4 +411,3 @@ map(member_preds, rmse_vec, truth = member_preds$latency) %>% As we can see, the stacked ensemble outperforms each of the member models, though is closely followed by one of its members. Voila! You've now made use of the stacks package to predict red-eyed tree frog embryo hatching using a stacked ensemble! The full visual outline for these steps can be found [here](https://github.com/tidymodels/stacks/blob/main/inst/figs/outline.png). - diff --git a/vignettes/classification.Rmd b/vignettes/classification.Rmd index 6bd28125..bd5f8345 100644 --- a/vignettes/classification.Rmd +++ b/vignettes/classification.Rmd @@ -50,12 +50,11 @@ knitr::opts_chunk$set( ) ``` - -In this example, we'll make use of the `tree_frogs` data exported with `stacks`, giving experimental results on hatching behavior of red-eyed tree frog embryos! +In this example, we'll make use of the `tree_frogs` data exported with stacks, giving experimental results on hatching behavior of red-eyed tree frog embryos! Red-eyed tree frog (RETF) embryos can hatch earlier than their normal 7ish days if they detect potential predator threat. Researchers wanted to determine how, and when, these tree frog embryos were able to detect stimulus from their environment. To do so, they subjected the embryos at varying developmental stages to "predator stimulus" by jiggling the embryos with a blunt probe. Beforehand, though, some of the embryos were treated with gentamicin, a compound that knocks out their lateral line (a sensory organ). Researcher Julie Jung and her crew found that these factors inform whether an embryo hatches prematurely or not! -In this article, we'll use most all of the variables in `tree_frogs` to predict `reflex`, a measure of ear function called the vestibulo-ocular reflex (VOR), categorized into bins. Ear function increases from factor levels "low", to "mid", to "full". +In this article, we'll use most all of the variables in `tree_frogs` to predict "reflex," a measure of ear function called the vestibulo-ocular reflex (VOR), categorized into bins. Ear function increases from factor levels "low", to "mid", to "full". ```{r, message = FALSE, warning = FALSE} data("tree_frogs") @@ -110,7 +109,7 @@ We also need to use the same control settings as in the numeric response setting ctrl_grid <- control_stack_grid() ``` -We'll define two different model definitions to try to predict `reflex`—a random forest and a neural network. +We'll define two different model definitions to try to predict "reflex"—a random forest and a neural network. Starting out with a random forest: @@ -184,25 +183,25 @@ tree_frogs_model_st <- tree_frogs_model_st ``` -To make sure that we have the right trade-off between minimizing the number of members and optimizing performance, we can use the `autoplot()` method: +To make sure that we have the right trade-off between minimizing the number of members and optimizing performance, we can use the `autoplot()` method: ```{r penalty-plot, fig.alt = "A ggplot line plot. The x axis shows the degree of penalization, ranging from 1e-06 to 1e-01, and the y axis displays the mean of three different metrics. The plots are faceted by metric type, with three facets: accuracy, number of members, and ROC AUC. The plots generally show that, as penalization increases, the error increases, though fewer members are included in the model. A dashed line at a penalty of 1e-05 indicates that the stack has chosen a smaller degree of penalization."} autoplot(tree_frogs_model_st) ``` -To show the relationship more directly: +To show the relationship more directly: ```{r members-plot, fig.alt = "A similarly formatted ggplot line plot, showing that greater numbers of members result in higher accuracy."} autoplot(tree_frogs_model_st, type = "members") ``` -If these results were not good enough, `blend_predictions()` could be called again with different values of `penalty`. As it is, `blend_predictions()` picks the penalty parameter with the numerically optimal results. To see the top results: +If these results were not good enough, `blend_predictions()` could be called again with different values of "penalty." As it is, `blend_predictions()` picks the penalty parameter with the numerically optimal results. To see the top results: ```{r weight-plot, fig.alt = "A ggplot bar plot, giving the stacking coefficient on the x axis and member on the y axis. Bars corresponding to neural networks are shown in red, while random forest bars are shown in blue. Generally, the neural network tends to accentuate features of the 'low' response, while the random forest does so for the 'mid' response."} autoplot(tree_frogs_model_st, type = "weights") ``` -There are multiple facets since the ensemble members can have different effects on different classes. +There are multiple facets since the ensemble members can have different effects on different classes. To identify which model configurations were assigned what stacking coefficients, we can make use of the `collect_parameters()` function: @@ -254,4 +253,4 @@ map( pivot_longer(c(everything(), -reflex)) ``` -Voilà! You've now made use of the stacks package to predict tree frog embryo ear function using a stacked ensemble! +Voilà! You've now made use of the stacks package to predict tree frog embryo ear function using a stacked ensemble!