-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get_test_data()
does not consider lagged differences of lagged differences or lags of lags
#359
Comments
Potential workaround in some cases: add a |
Updated this to describe an approach tracking shift sets rather than shift ranges. This could be useful in the context of:
|
@rnayebi21 also encountered an issue when manually calculating lags 7 and 14 of a signal, then using |
This workaround didn't end up working. I tried lagging on 0 variables and got "object 'value' not found" |
Current workaround that has worked for me is to use the following instead: Also for context, the error occurs when I'm using an ahead larger than 24, but only occurs in the |
In terms of the workaround, the the following should work: (updated to make slightly more similar to your case) library(epipredict)
jhu <- case_death_rate_subset %>%
filter(time_value >= "2021-01-01", geo_value %in% c("ca", "ny"))
r <- epi_recipe(jhu) %>%
step_epi_lag(case_rate, lag = 7L) %>%
step_lag_difference(lag_7_case_rate, horizon = 7, prefix = "one") %>%
step_lag_difference(starts_with("one"), horizon = 7, prefix = "two") %>%
step_epi_lag(case_rate, lag = 21, prefix = "rm") %>%
step_epi_ahead(death_rate, ahead = 7) %>%
recipes::step_rm(starts_with("rm"))
frost <- frosting() %>% layer_naomit(.pred)
wf <- epi_workflow(r, linear_reg(), frost)
fitted <- fit(wf, jhu)
forecast(fitted) %>% filter(time_value == max(jhu$time_value))
#> An `epi_df` object, 2 x 3 with metadata:
#> * geo_type = state
#> * time_type = day
#> * as_of = 2022-05-31 15:08:25.791826
#>
#> # A tibble: 2 × 3
#> geo_value time_value .pred
#> * <chr> <date> <dbl>
#> 1 ca 2021-12-31 0.117
#> 2 ny 2021-12-31 0.192 |
For the original problem, I think we were looking at a role-selection or names-based workaround similar to the above. The roles-based thing failed due to the tidy selector not working, and the names-based I'm guessing ran into the lack of [mentioned above already] But @rnayebi21 is trying out another alternative: use |
@rnayebi21 was trying to calculate second differences with
step_lag_difference()
+ anotherstep_lag_difference()
on the generated output. Butget_test_data()
'shorizon
processing assumes that these are both calculated from "original" signals (and maybe also that for each epikey, at the latest time value available for this epikey, that these signals are both nonmissing). The result is too short a time window for predictions. E.g., lagged differencing withhorizon = 7
followed by anotherhorizon = 7
will makeget_test_data()
filter to around 8 days, but we actually need around 15 days. Additionally, the eventual output error message appears to be deeply nested and unhelpful, fromstopifnot(length(values) == length(quantile_levels))
.Potential resolutions:
Modify
step_epi_shift()
,step_lag_difference()
, andstep_growth_rate()
to do some additional tagging of outputs based on the shift range [given (Gaps causing) missing test predictors gives confusing error #362, maybe actually a shift set] they depend on + check their inputs for such tags and consider it in that logic, and extract that info inget_test_data()
. And make sure to appropriately labelstep_lag_difference()
s operation as a lag, not a horizon (the currenthorizon
naming seems like a hack to make lag_difference + lag work). Maybe also export some related utilities to let developers manipulate these tags if they want to create new steps.Try to avoid
get_test_data()
altogether e.g. with this sort of approach.Hybrid approach:
[0, 0]
[shift set of {0}]. (That probably simplifies the first resolution's logic anyway.)get_test_data()
, if not all relevant variables have shift ranges, then assume a shift range of(-Infty, Infty)
[or a special value representing this / branch handling this case if using shift range approach]. Bake the data, getting extra rows. But then filter those baked results to just the latest time value (or latest time_value with nonmissing predictors?? not sure) per epikey.get_test_data()
+ baking causes epikeys to drop out of the data set, especially if it causes all epikeys to drop out of the data set. (By an epi_key dropping out, I mean something likedrop_na(baked_test_data, all_predictors())
yielding 0 rows with that epikey when that epikey was "present" in the "original" data.... for some definitions of "present" and "original"....)The text was updated successfully, but these errors were encountered: