Add inner_split()
methods for bootstrap
#488
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Unlike the
inner_split()
methods in #483, it's not straightforward what the splitting mechanism should be here.Our main concern is data leakage:
The bootrap sample in
boot_split
orgroup_boot_split
likely contains several replications of an observation. We don't want those to be split up into the inner analysis and inner assessment set. Option 1 in the graph below is an example of that.Options 2 and 3 in the graph try to avoid this by
group_boot_split
, the original group combined with the row id). This would mean that rows in the inner assessment set are potentially not unique, unlike the typical bootstrap OOB sample.Further thoughts:
With each bootrap sampling, we essentially put 1/3 into the assessment set. This could hurt us quickly, especially for small data. "Small data" is a problem for all other sampling procedures as well, but they usually have a dial to turn to affect that proportion.
People could abandon fidelity to the bootstrap idea here by specifying a different sampling procedure for the inner split in
add_tailor()
when making their workflow.