Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructuring grid search processing #980

Open
topepo opened this issue Feb 15, 2025 · 1 comment
Open

Restructuring grid search processing #980

topepo opened this issue Feb 15, 2025 · 1 comment

Comments

@topepo
Copy link
Member

topepo commented Feb 15, 2025

We're making a substantial change to the workhorse of the package for two reasons:

  • complete conversion to the future package.
  • enable efficient tuning of postprocessing parameters

Currently, the computing path is:

tune_grid.* -> tune_grid_workflow
└─ tune_grid_workflow -> tune_grid_loop
    └─ tune_grid_loop -> tune_grid_loop_tune (aka fn_tune_grid_loop)
         └─ tune_grid_loop_tune -> tune_grid_loop_iter (aka fn_tune_grid_loop_iter)

where

  • tune_grid_loop() is the top-level call to compute. After computations, the individual results (e.g., metrics, predictions, extracts, etc.) are partitioned out of the results object.

  • tune_grid_loop_tune() (re)sets parallel_over based on number of resamples then calls the iterator

  • tune_grid_loop_iter() is the script that goes through the conditional execution process from preprocessing to model prediction.

Plan for the new approach

We create a schedule of computations (#974 and #978) for a grid of tuning parameters. This defines the conditional execution that loops over preprocessors, models, and now postprocessors. It also accounts for speed-ups achieved via submodel parameters.

Given the grid and a specific resample rsplit object (that defines the analysis and assessment data), we can run a function to create the schedule and execute it (current pet name is "loopy()"). This will return all of our default results and any optional results (e.g. predictions and extracts).

Let's say we have B resamples and S grid points. We can call loopy() in a few different ways. Currently, tune defaults to a loop of B iterations, each processing the S grid points. However, one option (controlled by parallel_over, see this section of TMwR) "flattens" the loop so that all B*S tasks can be run in parallel.

We can choose which path to take using this pseudocode:

# `splits` is a list of B rsplit objects
# `grid` is the data frame of S candidates (in rows)
# `grid_rows` is `grid` decomposed into a list of S 1-point grid subsets

if (parallel_over == "resamples") {
  # The default
  # Loop over B splits, process whole grid of S candidates
	
  res <- map(splits, ~ loopy(.x, grid))
} else {
  # Do all at once either because preprocessing is cheap or a validation set is being used. 
  
  # Make a list of all combinations of indices for splits and candidates. 
  indices <- crossing(s = seq_along(grid_rows), b = seq_along(splits))
  indices <- vec_split(indices, by = 1:nrow(indices))
	
  res <- map(indices, ~ loopy(splits[[.x$b]], grid_rows[[.x$s]]))
}

We'll probably map using future.apply::future_lapply().

We hope think that the new code path is:

tune_grid.* -> tune_grid_workflow
└─ tune_grid_workflow -> tune_grid_loop

tune_grid_loop() will set the data options and options, execute the pseudo code above, and parse the results into different components.

Special cases and notes

  • We will keep Simon's logging method to catalog messages, warnings, and errors methodically.

  • We must branch inside loopy() to handle h2o processing via the agua package.

  • We will not have a dependency on foreach anymore. 😿

  • The pattern used in .config is currently "Preprocessor{X}Model{X}". We'll change this to "pre{X}_mod{X}_post{X}" where "{X}` is padded with zeros or just zero when there are no pre- or postprocessing tuning parameters.

  • In the pseudo-code above, grid_rows is a little more complex than it appears. Instead of S 1-point grids, it can have multiple rows when a submodel parameter is being tuned. To do this, we emulate min_grid and group the grid candidates into unique combinations of all non-submodel parameters. For example, if a regular grid is made with 3 levels for each of 3 parameters (2 non-submodel and a single submodel parameter). The regular grid will have 27 rows, but grid_rows will be a list of 9 grids, each having three rows. Containing the submodels in each sub-grid will allow the schedule and loopy() to gain their submodel speed-ups.

@topepo
Copy link
Member Author

topepo commented Feb 16, 2025

I'm using a temporary repo (https://github.com/topepo/melodie), that is a trimmed down version of tune, to experiment with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant