Unexpected behavior in `.BY` #5389

cstepper · 2022-05-23T08:27:10Z

Hi,

we are trying to extract the .BY information to a list object outside the current data.table. Unfortunately, we get unexpected results.
All elements of the list of length(by groups) contain the same value - the .BY information from the last group.
When using .GRP instead, the list is populated with different values - the correct group indices.

Is this expected behavior, and how can I get a list of correct by-group values be extracted?

Thanks!

library(data.table)

dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

byg = list()
grp = list()
dt[
  , m := {
    byg <<- append(byg, .BY)
    grp <<- append(grp, .GRP)
    mean(b)
  }
  , by = l
]

str(byg)
#> List of 5
#>  $ l: chr "e"
#>  $ l: chr "e"
#>  $ l: chr "e"
#>  $ l: chr "e"
#>  $ l: chr "e"
str(grp)
#> List of 5
#>  $ : int 1
#>  $ : int 2
#>  $ : int 3
#>  $ : int 4
#>  $ : int 5

sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.2
#> 
#> loaded via a namespace (and not attached):
#>  [1] rstudioapi_0.13   knitr_1.39        magrittr_2.0.3    R.cache_0.15.0   
#>  [5] rlang_1.0.2       fastmap_1.1.0     fansi_1.0.3       stringr_1.4.0    
#>  [9] styler_1.7.0      highr_0.9         tools_4.2.0       xfun_0.31        
#> [13] R.oo_1.24.0       utf8_1.2.2        cli_3.3.0         withr_2.5.0      
#> [17] htmltools_0.5.2   ellipsis_0.3.2    yaml_2.3.5        digest_0.6.29    
#> [21] tibble_3.1.7      lifecycle_1.0.1   crayon_1.5.1      purrr_0.3.4      
#> [25] R.utils_2.11.0    vctrs_0.4.1       fs_1.5.2          glue_1.6.2       
#> [29] evaluate_0.15     rmarkdown_2.14    reprex_2.0.1      stringi_1.7.6    
#> [33] compiler_4.2.0    pillar_1.7.0      R.methodsS3_1.8.1 pkgconfig_2.0.3

^{Created on 2022-05-23 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

ben-schwen · 2022-05-23T10:17:54Z

Seems more to be an issue of the collecting procedure, since printing the values works as expected.

library(data.table)
dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

dt[
  , m := {
    cat("BY:", as.character(.BY), "\n")
    cat("GRP:", as.character(.GRP), "\n")
    mean(b)
  }
  , by = l
]
#> BY: a 
#> GRP: 1 
#> BY: b 
#> GRP: 2 
#> BY: c 
#> GRP: 3 
#> BY: d 
#> GRP: 4 
#> BY: e 
#> GRP: 5

ben-schwen · 2022-05-23T10:32:06Z

It seems that Rs lazy evaluation comes here into play. Forcing evaluation counters this.

library(data.table)

dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

byg = list()
grp = list()
dt[
  , m := {
    byg <<- append(byg, force(unlist(.BY)))
    grp <<- append(grp, .GRP)
    mean(b)
  }
  , by = l
]
str(byg)
#> List of 5
#>  $ l: chr "a"
#>  $ l: chr "b"
#>  $ l: chr "c"
#>  $ l: chr "d"
#>  $ l: chr "e"
str(grp)
#> List of 5
#>  $ : int 1
#>  $ : int 2
#>  $ : int 3
#>  $ : int 4
#>  $ : int 5

cstepper · 2022-05-27T10:03:20Z

Thanks for your comments @ben-schwen .

When using unlist(), forcing is not required.
However, I haven't found a solution for getting back .BY without modifying it before (I do not want to unlist due to coercion).
My feeling is, that when sending the .BY to a function using some src code (e.g. unlist, purrr::flatten), it works.

Is this expected? Any ideas for just returning the .BY lists?

library(data.table)

dt = data.table(
  l = sample(letters[1:2], 100, replace = TRUE)
  , n = sample(1L:2L, 100, replace = TRUE)
  , b = rnorm(100)
)

keys = c("l", "n")
setkeyv(dt, keys)

by_list = list()
by_unlist = list()
by_flatten = list()
dt[
  , m := {
    by_list <<- append(by_list, list(.BY))
    by_unlist <<- append(by_unlist, list(unlist(.BY)))
    by_flatten <<- append(by_flatten, list(purrr::flatten(.BY)))
    mean(b)
  }
  , by = keys
]

str(by_list)
#> List of 4
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2
str(by_unlist)
#> List of 4
#>  $ : Named chr [1:2] "a" "1"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
#>  $ : Named chr [1:2] "a" "2"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
#>  $ : Named chr [1:2] "b" "1"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
#>  $ : Named chr [1:2] "b" "2"
#>   ..- attr(*, "names")= chr [1:2] "l" "n"
str(by_flatten)
#> List of 4
#>  $ :List of 2
#>   ..$ l: chr "a"
#>   ..$ n: int 1
#>  $ :List of 2
#>   ..$ l: chr "a"
#>   ..$ n: int 2
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 1
#>  $ :List of 2
#>   ..$ l: chr "b"
#>   ..$ n: int 2

^{Created on 2022-05-27 by the reprex package (v2.0.1)}

cstepper · 2022-05-27T13:22:58Z

Please let me know, if you think this should better discussed on Stackoverflow. Thanks

ben-schwen · 2022-05-27T16:33:18Z

If you take a look at the source of dogropus

data.table/src/dogroups.c

Lines 91 to 104 in e9a323d

    
           for (int i=0; i<ngrpcols; ++i) { 
        
             int j = INTEGER(grpcols)[i]-1; 
        
             SET_VECTOR_ELT(BY, i, allocVector(TYPEOF(VECTOR_ELT(groups, j)), 
        
               nrowgroups ? 1 : 0)); // TODO: might be able to be 1 always but 0 when 'groups' are integer(0) seem sensible. #2440 was involved in the past. 
        
             // Fix for #36, by cols with attributes when also used in `j` lost the attribute. 
        
             copyMostAttrib(VECTOR_ELT(groups, j), VECTOR_ELT(BY,i));  // not names, otherwise test 778 would fail 
        
             SET_STRING_ELT(bynames, i, STRING_ELT(getAttrib(groups,R_NamesSymbol), j)); 
        
             defineVar(install(CHAR(STRING_ELT(bynames,i))), VECTOR_ELT(BY,i), env);      // by vars can be used by name in j as well as via .BY 
        
             if (SIZEOF(VECTOR_ELT(BY,i))==0) 
        
               error(_("Internal error: unsupported size-0 type '%s' in column %d of 'by' should have been caught earlier"), type2char(TYPEOF(VECTOR_ELT(BY, i))), i+1); // # nocov 
        
             SET_TRUELENGTH(VECTOR_ELT(BY,i), -1); // marker for anySpecialStatic(); see its comments 
        
           } 
        
           setAttrib(BY, R_NamesSymbol, bynames); // Fix for #42 - BY doesn't retain names anymore 
        
           R_LockBinding(sym_BY, env);

you can see that BY is always directly assigned and overwritten for each group, so with assigning it to a list, you always assign the same memory pointer.

This is also confirmed by the following.

library(data.table)
dt = data.table(l = sample(letters[1:5], 100, replace = TRUE), b = rnorm(100))
setkey(dt, l)

by_list = list()
# you should preallocate in R
# by_list = vector(mode="list", length(unique(dt$l)))

dt[
  , m := {
    by_list <<- append(by_list, copy(.BY))
    cat(address(.BY), "\n")
    mean(b)
  }
  , by = l
]
#> 0x56343a667930 
#> 0x56343a667930 
#> 0x56343a667930 
#> 0x56343a667930 
#> 0x56343a667930
by_list
#> $l
#> [1] "a"
#> 
#> $l
#> [1] "b"
#> 
#> $l
#> [1] "c"
#> 
#> $l
#> [1] "d"
#> 
#> $l
#> [1] "e"

You can achieve your wanted behavior with using copy(.BY). Similar info is in the help of ?.BY but it only mentions this behavior for .SD.

The working assignment without copying probably depends on whether R is internally copying or not. In previous R versions, it would always copy but R-core improved this a lot in the last years.

Anyway, we should clarify this in the documentation. So I guess here is the right place to discuss it.

ben-schwen added the documentation label May 27, 2022

This was referenced Jun 19, 2024

Documentation of by about its order #4281

Closed

Clarify Usage of .BY in Special Symbols Documentation #6193

Open

GSOC'24 ENHANCING DATA.TABLE PROJECT DISCUSSION THREAD Nj221102/data.table-gSoc-24-project#1

Open

Nj221102 self-assigned this Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior in `.BY` #5389

Unexpected behavior in `.BY` #5389

cstepper commented May 23, 2022

ben-schwen commented May 23, 2022

ben-schwen commented May 23, 2022

cstepper commented May 27, 2022

cstepper commented May 27, 2022

ben-schwen commented May 27, 2022 •

edited

Loading

Unexpected behavior in .BY #5389

Unexpected behavior in .BY #5389

Comments

cstepper commented May 23, 2022

ben-schwen commented May 23, 2022

ben-schwen commented May 23, 2022

cstepper commented May 27, 2022

cstepper commented May 27, 2022

ben-schwen commented May 27, 2022 • edited Loading

Unexpected behavior in `.BY` #5389

Unexpected behavior in `.BY` #5389

ben-schwen commented May 27, 2022 •

edited

Loading