You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tune_grid_loop() is the top-level call to compute. After computations, the individual results (e.g., metrics, predictions, extracts, etc.) are partitioned out of the results object.
tune_grid_loop_tune() (re)sets parallel_over based on number of resamples then calls the iterator
tune_grid_loop_iter() is the script that goes through the conditional execution process from preprocessing to model prediction.
Plan for the new approach
We create a schedule of computations (#974 and #978) for a grid of tuning parameters. This defines the conditional execution that loops over preprocessors, models, and now postprocessors. It also accounts for speed-ups achieved via submodel parameters.
Given the grid and a specific resample rsplit object (that defines the analysis and assessment data), we can run a function to create the schedule and execute it (current pet name is "loopy()"). This will return all of our default results and any optional results (e.g. predictions and extracts).
Let's say we have B resamples and S grid points. We can call loopy() in a few different ways. Currently, tune defaults to a loop of B iterations, each processing the S grid points. However, one option (controlled by parallel_over, see this section of TMwR) "flattens" the loop so that all B*S tasks can be run in parallel.
We can choose which path to take using this pseudocode:
# `splits` is a list of B rsplit objects# `grid` is the data frame of S candidates (in rows)# `grid_rows` is `grid` decomposed into a list of S 1-point grid subsetsif (parallel_over=="resamples") {
# The default# Loop over B splits, process whole grid of S candidatesres<- map(splits, ~ loopy(.x, grid))
} else {
# Do all at once either because preprocessing is cheap or a validation set is being used. # Make a list of all combinations of indices for splits and candidates. indices<- crossing(s= seq_along(grid_rows), b= seq_along(splits))
indices<- vec_split(indices, by=1:nrow(indices))
res<- map(indices, ~ loopy(splits[[.x$b]], grid_rows[[.x$s]]))
}
We'll probably map using future.apply::future_lapply().
tune_grid_loop() will set the data options and options, execute the pseudo code above, and parse the results into different components.
Special cases and notes
We will keep Simon's logging method to catalog messages, warnings, and errors methodically.
We must branch inside loopy() to handle h2o processing via the agua package.
We will not have a dependency on foreach anymore. 😿
The pattern used in .config is currently "Preprocessor{X}Model{X}". We'll change this to "pre{X}_mod{X}_post{X}" where "{X}` is padded with zeros or just zero when there are no pre- or postprocessing tuning parameters.
In the pseudo-code above, grid_rows is a little more complex than it appears. Instead of S 1-point grids, it can have multiple rows when a submodel parameter is being tuned. To do this, we emulate min_grid and group the grid candidates into unique combinations of all non-submodel parameters. For example, if a regular grid is made with 3 levels for each of 3 parameters (2 non-submodel and a single submodel parameter). The regular grid will have 27 rows, but grid_rows will be a list of 9 grids, each having three rows. Containing the submodels in each sub-grid will allow the schedule and loopy() to gain their submodel speed-ups.
The text was updated successfully, but these errors were encountered:
We're making a substantial change to the workhorse of the package for two reasons:
Currently, the computing path is:
where
tune_grid_loop()
is the top-level call to compute. After computations, the individual results (e.g., metrics, predictions, extracts, etc.) are partitioned out of the results object.tune_grid_loop_tune()
(re)setsparallel_over
based on number of resamples then calls the iteratortune_grid_loop_iter()
is the script that goes through the conditional execution process from preprocessing to model prediction.Plan for the new approach
We create a schedule of computations (#974 and #978) for a grid of tuning parameters. This defines the conditional execution that loops over preprocessors, models, and now postprocessors. It also accounts for speed-ups achieved via submodel parameters.
Given the grid and a specific resample
rsplit
object (that defines the analysis and assessment data), we can run a function to create the schedule and execute it (current pet name is"loopy()"
). This will return all of our default results and any optional results (e.g. predictions and extracts).Let's say we have
B
resamples andS
grid points. We can callloopy()
in a few different ways. Currently, tune defaults to a loop ofB
iterations, each processing theS
grid points. However, one option (controlled byparallel_over
, see this section of TMwR) "flattens" the loop so that allB*S
tasks can be run in parallel.We can choose which path to take using this pseudocode:
We'll probably map using
future.apply::future_lapply()
.We
hopethink that the new code path is:tune_grid_loop()
will set the data options and options, execute the pseudo code above, and parse the results into different components.Special cases and notes
We will keep Simon's logging method to catalog messages, warnings, and errors methodically.
We must branch inside
loopy()
to handle h2o processing via the agua package.We will not have a dependency on foreach anymore. 😿
The pattern used in
.config
is currently"Preprocessor{X}Model{X}"
. We'll change this to"pre{X}_mod{X}_post{X}"
where "{X}` is padded with zeros or just zero when there are no pre- or postprocessing tuning parameters.In the pseudo-code above,
grid_rows
is a little more complex than it appears. Instead of S 1-point grids, it can have multiple rows when a submodel parameter is being tuned. To do this, we emulatemin_grid
and group the grid candidates into unique combinations of all non-submodel parameters. For example, if a regular grid is made with 3 levels for each of 3 parameters (2 non-submodel and a single submodel parameter). The regulargrid
will have 27 rows, butgrid_rows
will be a list of 9 grids, each having three rows. Containing the submodels in each sub-grid will allow the schedule andloopy()
to gain their submodel speed-ups.The text was updated successfully, but these errors were encountered: