Skip to content

Rethinking degroup() for cross-classified data #637

@mattansb

Description

@mattansb

Follow-up up @jmgirard's #520

I think the current implementation of cross-classified disaggregation is missing a desiderata.

First let's note some desiderata that we do have:

  1. The "between" variable are simply the separate group means.
Make some crossed data
mu <- 100
ul <- setNames(c(-1, -3, 0, 4), nm = letters[1:4])
uL <- setNames(c(10, 30, 0, -40), nm = LETTERS[1:4])
um <- setNames(c(100, 150, -250), nm = month.abb[1:3])

dat <- expand.grid(l = letters[1:4], L = LETTERS[1:4], m = month.abb[1:3])

set.seed(111)
e <- rnorm(nrow(dat)-1) |> round(2)
e <- append(e, -sum(e))

dat$y <- mu + ul[dat$l] + uL[dat$L] + um[dat$m] + e
dat$z <- mu + ul[dat$l] + uL[dat$L] + um[dat$m] + 10*e
dat_dem <- datawizard::demean(dat, by = c("l", "L", "m"), select = c("y","z"))

all.equal(c(dat_dem$y_l_between), ave(dat$y, dat$l))
#> TRUE
all.equal(c(dat_dem$y_L_between), ave(dat$y, dat$L))
#> TRUE
all.equal(c(dat_dem$y_m_between), ave(dat$y, dat$m))
#> TRUE
  1. The sum of an observation's "between"/"within" variables is equal to the original observation
all.equal(rowSums(dat_dem[grepl("^y_", colnames(dat_dem))]), dat$y)
#> TRUE

What we don't have is that -- unlike with a single grouping variable or with nested designs -- the "within" variable is mean centered:

mean(dat_dem$y_within)
#> -200

This is equal to $(-\bar{Y})\times (\text{number of grouping vars} - 1)$

-mean(dat$y) * (3-1)
#> -200

I think this is something we want, for consistency (typically "within" is considered to be automatically double-centered), however with a crossed design this cannot be achieved without compromising on desiderata 1 or 2.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions