Multiple predictors #90

lindeloev · 2020-08-10T21:44:22Z

Each segment should take an arbitrary number of linear predictors. As with the segmented package, the only requirement is that one continuous predictor (say, x) is the dimension of the change point. The change point is simply the value on x where the predictions of y changes to a different regression model (parameter structure and/or values).

So this API should work. It has the following features:

Support categorical predictors via dummy coding.
Support interactions
Each segment can have different numbers of predictors.

model = list(
    y ~ 1 + x*group + z + sigma(1 + group),  # interactions and main effects and a covariate.
    ~ 0 + x + ar(2, z),  # only one slope
    ~ 1 + group  # a range of x where group is the only predictor
)

JAGS-wise the indicator functions would be the same but now we additionally pass design matrices (X1_, X2_, etc.) and use inprod() per segment. The model above would be something like:

    # Priors for individual parameters
    cp_1 ~ ... T(MINX, MAXX)
    cp_2 ~ ... T(cp_1, MAXX)
    int_1 ~ dnorm(0, 1^-2)
    int_3 ~ ...
    xGroupFemale_1 ~ dnorm(0, 1)  # check R naming convention
    z_1 ~ dunif(0, 100)
    x_2 ~ dnorm(4, 3^-2)
    xGroupFemale_3 ~ ...

    # Model and likelihood
    for(i_ in 1:length(x1_)) {
        y_[i_] = (x_[i_] > cp_0) * (int_1 + inprod(c(xGroupFemale_1, z_1), x1_[i_, ])) + 
                (x_[i_] > cp_1) * (0 + inprod(c(x_2), x2_[i_, ])) +
                (x_[i_] > cp_2) * (int_3 + inprod(c(xGroupFemale_3), x3_[i_, ]))

        response[i_] ~ dnorm(y_[i_], sigma_[i_])
    }

where xi_ is a model matrix that is built R-side and x_ is par_x along which change points are defined. Implementing this adds the following work points:

Data structure

Rename intercept: Intercept_i instead of int_i.
Require par_x if there is not exactly one continuous predictor.
Deicide parameter naming: stay close to lm and brms but add _segmentnumber.
Split out RHS parameters from the segment table to a new long-format structure.
Set appropriate priors. Distinguish between intercepts, dummies, and slopes?
Raise error on rank deficiency using base::qr().

Modeling, sampling, and summaries

Update get_formula() to match the new segment table.
Update get_jagscode()
Update run_jags() and get_jags_data() to work with design matrices. One per (dpar-segment) combo.
Make sure that it applies for sigma(1 + x * group) etc. (I think it will out of the box).
Verify that it works for plot_pars(), hypothesis(), summary(), fixef(), etc.
get_summary(): Translate between code parameter name and user-facing parameter name.

Simulated, fitted, and predicted values

Change args to fit$simulate(fit, data, par1, par2, ...), i.e., add fit and replace par_x with data.frame/tibble.
Check that the columns in data have the correct format. Set factor levels to match the original data.
Make fit$simulate() a wrapper around a lower-level fast function to use internally. Call it fit$.internal$simulate_vec(par_x, cp_1, ..., rhs_par1, rhs_par2, ...). Only the former should do asserts, call add_simulated(), etc.
Implement for ar()
Implement for varying change points.
Call simulate_vectorized() from all internal functions instead of fit$simulate().
Ensure that it works for fitted(), predict(), pp_check(), etc.
Fix existing mcp_examples
Add new mcp_examples?

Plot

Allow faceting by any unique combination of (categorical) variables using plot(fit, facet_by = c("my_rhs", "my_varying_cp")). Still default to no facets.
Control spaghetti groupings and colors using color_by = c("my_categorical1", "my_categorical2). It defaults to color_by = "all_categorical", i.e. all unique combinations of categorical levels on RHS. This will also set the grouping for spaghettis. I think that color_by should pertain solely to the RHS which share change points. Varying change points will not be accepted.
Add a way to include a subset of the predictors: plot(fit, effects = "my_categorical1"). It's like brms::marginal_effects(). This should probably be implemented in tidy_samples().
Display only a subset of the levels using plot(fit, filter = data.frame(my_categorical1 = c("levelA", "levelB"), my_categorical2 = "level1"). This is like brms::marginal_effects(), only filter using a data.frame replaces int_conditions which is a named list. For variables in effects that are not in filter, all levels will be included. This should probably be implemented in tidy_samples().
(Add option to facet_by group levels in pp_check?)
Add pages to plot_pars()

Tests

Pass existing test suite.
Write tests for combinations of main effects, factor-factor interactions, and factor-continuous interactions on ct, sigma, and ar. Also test the number of expected model parameters with and without intercepts.
Expect errors on multiple terms inside base functions, e.g., ~exp(1 + x ).
Test the new plot functions and tidy_draws()

The text was updated successfully, but these errors were encountered:

sjmgarnier · 2020-09-14T13:37:03Z

@lindeloev Out of curiosity, do you have a timeline on when this will be implemented?

lindeloev · 2020-09-14T18:36:24Z

@sjmgarnier I got this to work locally but while it provides much more modeling options (and is easier to maintain/extend), sampling of currently supported models take double the time. mcp is already like 100x slower than other packages. Tries to be useful for modeling options rather than speed. So I'm a bit in two minds whether to continue down this path. What do you think?

sjmgarnier · 2020-09-14T19:06:29Z

@lindeloev I'll answer very selfishly by saying that I need it for a project that I'm working on :-) I could not find a satisfying alternative to perform weighted segmented regressions with random effects. And I'm a patient man with a powerful computer, so I can wait for a few minutes that the fit finishes. But I'm also very well aware of how time-consuming it is to develop and maintain packages, so I would completely understand if you decided to focus on other priorities.

lindeloev · 2020-09-14T19:46:26Z

Cool, mcp is henceforth aimed at patient users with powerful computers :-) Yes, for data with < 200 data points, we're talking a slowdown from ~10 secs to ~20 secs so it's not the end of the world. I think this would be awesome so it's my top priority for the next release (mcp 0.4). Though I think that it will likely be at least a few months before it is out.

The first version with this will likely not have random effects on the RHS. But using a categorical intercept (like group in the OP) should be reasonably close in many scenarios since shrinkage is often minor.

lindeloev · 2020-12-15T09:17:47Z

I'm working on this now and have almost finished making all design decisions. Luckily, I found an implementation that won't negatively affect performance. Expect release in a few months, depending on how hard it is to make sensible plot() etc.

mattmoo · 2021-01-17T05:32:50Z

Is this implemented in a development version? A reviewer demands a multivariate analysis :/

lindeloev · 2021-01-17T09:35:27Z

@mattmoo I just pushed the development version to branch v0.4 here. I think you can install it using remotes::install_github("lindeloev/[email protected]"). You can fit it and see parameter summaries, including plot_pars(fit). It works for e.g. y ~ x + sigma(1 + x:group) too.

But fit$simulate(), plot(), fitted()/predict(), and pp_check() won't, though I'm making good progress! So you may want to shift back and forth between versions if you need some of the old functionality for one-predictor-only models.

I've only tested it on gaussian models so far, so please tripple-check. And any feedback and ideas would be much appreciated, BTW!

lindeloev · 2021-02-20T22:15:47Z

There is good progress on this. I just pushed the latest version to branch v0.4. Install using remotes::install_github("lindeloev/[email protected]"). The easiest way to see it in action is running result = mcp_example("multiple").

I'm basically just missing plot() support for varying effects and categorical predictors.

mattmoo · 2021-02-20T22:18:05Z

That's great! I'm waiting on compute resources to actually run the analysis (50,000 datapoints :/)

lindeloev · 2021-02-21T22:10:27Z

I just tested the performance locally on a Ryzen 5 3600. For 55.000 data points and a 15-parameter model with categorical predictors, I get around 1 sample per second. So if you run in parallel with default settings (3000 warmup iters + 1500 data iters), you might complete sampling in a matter of hours?

ex = mcp_example("multiple")
df_tmp = tidyr::expand_grid(rep = 1:250, ex$data) %>%  # upscale to 55000 data points
  dplyr::select(-rep)
fit = mcp(ex$model, df_tmp, par_x = "x", chains = 1, adapt = 100, iter = 100)

I just pushed a new commit to v0.4 which requires ~10% of the memory of the previous version during sampling. Maybe that helped too.

mattmoo · 2021-02-21T22:56:54Z

I'll give it a go, from my previous experience with these data (see link), the sampling does not converge so quickly (and I do not have such a nice processor!).

adrose · 2023-08-03T19:40:10Z

Hi, is this issue still up to date or is there additional information for how to incorporate additional predictors?

lindeloev · 2023-08-03T20:16:10Z

@adrose Unfortunately not. The v0.4 branch currently doesn't run out-of-the-box due to some backwards-incompatible changes in the dependency packages, that I only incorporated into the v0.3 series. I'm presently prioritizing another project higher but really looking forward to getting v0.4 out since it's awesome!

gavanmcgrath · 2023-08-09T02:47:36Z

Love the package. Looking forward to v0.4.

lindeloev added the enhancement New feature or request label Aug 10, 2020

lindeloev added this to the 0.5 Multiple predictors milestone Aug 10, 2020

lindeloev self-assigned this Aug 10, 2020

lindeloev mentioned this issue Jan 5, 2021

Move rel() to priors #99

Open

5 tasks

lindeloev added this to the 0.4 Multiple regression and custom families milestone Jan 5, 2021

lindeloev mentioned this issue Jan 12, 2021

stan backend #100

Open

5 tasks

lindeloev added a commit that referenced this issue Jan 17, 2021

Progress: multiple predictors (#90) and priors (#99)

deff6d5

lindeloev mentioned this issue Jan 31, 2021

Coloring the posterior estimates by CP would make interpretation of plots easier #102

Open

lindeloev added a commit that referenced this issue Feb 19, 2021

Progress: simulate gaussian mu/sigma (#90)

05ac820

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple predictors #90

Multiple predictors #90

lindeloev commented Aug 10, 2020 •

edited

Loading

sjmgarnier commented Sep 14, 2020

lindeloev commented Sep 14, 2020

sjmgarnier commented Sep 14, 2020

lindeloev commented Sep 14, 2020

lindeloev commented Dec 15, 2020

mattmoo commented Jan 17, 2021

lindeloev commented Jan 17, 2021

lindeloev commented Feb 20, 2021

mattmoo commented Feb 20, 2021

lindeloev commented Feb 21, 2021

mattmoo commented Feb 21, 2021

adrose commented Aug 3, 2023

lindeloev commented Aug 3, 2023

gavanmcgrath commented Aug 9, 2023

Multiple predictors #90

Multiple predictors #90

Comments

lindeloev commented Aug 10, 2020 • edited Loading

Data structure

Modeling, sampling, and summaries

Simulated, fitted, and predicted values

Plot

Tests

sjmgarnier commented Sep 14, 2020

lindeloev commented Sep 14, 2020

sjmgarnier commented Sep 14, 2020

lindeloev commented Sep 14, 2020

lindeloev commented Dec 15, 2020

mattmoo commented Jan 17, 2021

lindeloev commented Jan 17, 2021

lindeloev commented Feb 20, 2021

mattmoo commented Feb 20, 2021

lindeloev commented Feb 21, 2021

mattmoo commented Feb 21, 2021

adrose commented Aug 3, 2023

lindeloev commented Aug 3, 2023

gavanmcgrath commented Aug 9, 2023

lindeloev commented Aug 10, 2020 •

edited

Loading