Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior in .BY #5389

Open
cstepper opened this issue May 23, 2022 · 5 comments · May be fixed by #6193
Open

Unexpected behavior in .BY #5389

cstepper opened this issue May 23, 2022 · 5 comments · May be fixed by #6193
Assignees

Comments

@cstepper
Copy link

Hi,

we are trying to extract the .BY information to a list object outside the current data.table. Unfortunately, we get unexpected results.
All elements of the list of length(by groups) contain the same value - the .BY information from the last group.
When using .GRP instead, the list is populated with different values - the correct group indices.

Is this expected behavior, and how can I get a list of correct by-group values be extracted?

Thanks!

library(data.table)

dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

byg = list()
grp = list()
dt[
  , m := {
    byg <<- append(byg, .BY)
    grp <<- append(grp, .GRP)
    mean(b)
  }
  , by = l
]

str(byg)
#> List of 5
#>  $ l: chr "e"
#>  $ l: chr "e"
#>  $ l: chr "e"
#>  $ l: chr "e"
#>  $ l: chr "e"
str(grp)
#> List of 5
#>  $ : int 1
#>  $ : int 2
#>  $ : int 3
#>  $ : int 4
#>  $ : int 5

sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.13   knitr_1.39        magrittr_2.0.3    R.cache_0.15.0   
#>  [5] rlang_1.0.2       fastmap_1.1.0     fansi_1.0.3       stringr_1.4.0    
#>  [9] styler_1.7.0      highr_0.9         tools_4.2.0       xfun_0.31        
#> [13] R.oo_1.24.0       utf8_1.2.2        cli_3.3.0         withr_2.5.0      
#> [17] htmltools_0.5.2   ellipsis_0.3.2    yaml_2.3.5        digest_0.6.29    
#> [21] tibble_3.1.7      lifecycle_1.0.1   crayon_1.5.1      purrr_0.3.4      
#> [25] R.utils_2.11.0    vctrs_0.4.1       fs_1.5.2          glue_1.6.2       
#> [29] evaluate_0.15     rmarkdown_2.14    reprex_2.0.1      stringi_1.7.6    
#> [33] compiler_4.2.0    pillar_1.7.0      R.methodsS3_1.8.1 pkgconfig_2.0.3

Created on 2022-05-23 by the reprex package (v2.0.1)

@ben-schwen
Copy link
Member

Seems more to be an issue of the collecting procedure, since printing the values works as expected.

library(data.table)
dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

dt[
  , m := {
    cat("BY:", as.character(.BY), "\n")
    cat("GRP:", as.character(.GRP), "\n")
    mean(b)
  }
  , by = l
]
#> BY: a 
#> GRP: 1 
#> BY: b 
#> GRP: 2 
#> BY: c 
#> GRP: 3 
#> BY: d 
#> GRP: 4 
#> BY: e 
#> GRP: 5

@ben-schwen
Copy link
Member

It seems that Rs lazy evaluation comes here into play. Forcing evaluation counters this.

library(data.table)

dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

byg = list()
grp = list()
dt[
  , m := {
    byg <<- append(byg, force(unlist(.BY)))
    grp <<- append(grp, .GRP)
    mean(b)
  }
  , by = l
]
str(byg)
#> List of 5
#>  $ l: chr "a"
#>  $ l: chr "b"
#>  $ l: chr "c"
#>  $ l: chr "d"
#>  $ l: chr "e"
str(grp)
#> List of 5
#>  $ : int 1
#>  $ : int 2
#>  $ : int 3
#>  $ : int 4
#>  $ : int 5

@cstepper
Copy link
Author

Thanks for your comments @ben-schwen .

When using unlist(), forcing is not required.
However, I haven't found a solution for getting back .BY without modifying it before (I do not want to unlist due to coercion).
My feeling is, that when sending the .BY to a function using some src code (e.g. unlist, purrr::flatten), it works.

Is this expected? Any ideas for just returning the .BY lists?

library(data.table)

dt = data.table(
  l = sample(letters[1:2], 100, replace = TRUE)
  , n = sample(1L:2L, 100, replace = TRUE)
  , b = rnorm(100)
)

keys = c("l", "n")
setkeyv(dt, keys)

by_list = list()
by_unlist = list()
by_flatten = list()
dt[
  , m := {
    by_list <<- append(by_list, list(.BY))
    by_unlist <<- append(by_unlist, list(unlist(.BY)))
    by_flatten <<- append(by_flatten, list(purrr::flatten(.BY)))
    mean(b)
  }
  , by = keys
]

str(by_list)
#> List of 4
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
str(by_unlist)
#> List of 4
#>  $ : Named chr [1:2] "a" "1"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
#>  $ : Named chr [1:2] "a" "2"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
#>  $ : Named chr [1:2] "b" "1"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
#>  $ : Named chr [1:2] "b" "2"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
str(by_flatten)
#> List of 4
#>  $ :List of 2
#>   ..$ l: chr "a"
#>   ..$ n: int 1
#>  $ :List of 2
#>   ..$ l: chr "a"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 1
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2

Created on 2022-05-27 by the reprex package (v2.0.1)

@cstepper
Copy link
Author

Please let me know, if you think this should better discussed on Stackoverflow. Thanks

@ben-schwen
Copy link
Member

ben-schwen commented May 27, 2022

If you take a look at the source of dogropus

data.table/src/dogroups.c

Lines 91 to 104 in e9a323d

for (int i=0; i<ngrpcols; ++i) {
int j = INTEGER(grpcols)[i]-1;
SET_VECTOR_ELT(BY, i, allocVector(TYPEOF(VECTOR_ELT(groups, j)),
nrowgroups ? 1 : 0)); // TODO: might be able to be 1 always but 0 when 'groups' are integer(0) seem sensible. #2440 was involved in the past.
// Fix for #36, by cols with attributes when also used in `j` lost the attribute.
copyMostAttrib(VECTOR_ELT(groups, j), VECTOR_ELT(BY,i)); // not names, otherwise test 778 would fail
SET_STRING_ELT(bynames, i, STRING_ELT(getAttrib(groups,R_NamesSymbol), j));
defineVar(install(CHAR(STRING_ELT(bynames,i))), VECTOR_ELT(BY,i), env); // by vars can be used by name in j as well as via .BY
if (SIZEOF(VECTOR_ELT(BY,i))==0)
error(_("Internal error: unsupported size-0 type '%s' in column %d of 'by' should have been caught earlier"), type2char(TYPEOF(VECTOR_ELT(BY, i))), i+1); // # nocov
SET_TRUELENGTH(VECTOR_ELT(BY,i), -1); // marker for anySpecialStatic(); see its comments
}
setAttrib(BY, R_NamesSymbol, bynames); // Fix for #42 - BY doesn't retain names anymore
R_LockBinding(sym_BY, env);

you can see that BY is always directly assigned and overwritten for each group, so with assigning it to a list, you always assign the same memory pointer.

This is also confirmed by the following.

library(data.table)
dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

by_list = list()
# you should preallocate in R
# by_list = vector(mode="list", length(unique(dt$l)))

dt[
  , m := {
    by_list <<- append(by_list, copy(.BY))
    cat(address(.BY), "\n")
    mean(b)
  }
  , by = l
]
#> 0x56343a667930 
#> 0x56343a667930 
#> 0x56343a667930 
#> 0x56343a667930 
#> 0x56343a667930
by_list
#> $l
#> [1] "a"
#> 
#> $l
#> [1] "b"
#> 
#> $l
#> [1] "c"
#> 
#> $l
#> [1] "d"
#> 
#> $l
#> [1] "e"

You can achieve your wanted behavior with using copy(.BY). Similar info is in the help of ?.BY but it only mentions this behavior for .SD.

The working assignment without copying probably depends on whether R is internally copying or not. In previous R versions, it would always copy but R-core improved this a lot in the last years.

Anyway, we should clarify this in the documentation. So I guess here is the right place to discuss it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants