Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set() that does not update any rows should still add cols #5409

Closed
mb706 opened this issue Jun 21, 2022 · 2 comments · Fixed by #6204
Closed

set() that does not update any rows should still add cols #5409

mb706 opened this issue Jun 21, 2022 · 2 comments · Fixed by #6204
Milestone

Comments

@mb706
Copy link

mb706 commented Jun 21, 2022

I would expect set(x, i, j, val) to work similarly to x[i, j := val] or x[i, j] <- val for integer i. However, there is a discrepancy when i is 0L: While <- and := add missing columns, set() does not.

x <- data.table(a = 1)
x[0L, "b"] <- character(0)
x
#>        a      b
#>    <num> <char>
#> 1:     1   <NA>

x <- data.table(a = 1)
x[0L, b := character(0)]
x
#>        a      b
#>    <num> <char>
#> 1:     1   <NA>

vs.

x <- data.table(a = 1)
set(x, 0L, "b", character(0))
x
#>        a
#>    <num>
#> 1:     1

sessionInfo()

R version 4.1.3 (2022-03-10)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Fedora Linux 35 (Thirty Five)

Matrix products: default
BLAS/LAPACK: /usr/lib64/libflexiblas.so.3.2

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.3

loaded via a namespace (and not attached):
[1] compiler_4.1.3 tools_4.1.3   
@jangorecki jangorecki added this to the 1.14.5 milestone Sep 21, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@ben-schwen
Copy link
Member

ben-schwen commented Jan 7, 2024

set of tests

dt = data.table(a=1L)
test(1, set(copy(dt), 0L,          "b", logical(0)), data.table(a=1L, b=NA))
test(1, set(copy(dt), NA_integer_, "b", NA),         data.table(a=1L, b=NA))
test(1, set(copy(dt), NA_integer_, "b", logical(0)), copy(dt)[0L, b := logical(0)])
test(1, set(copy(dt), 0L,          "b", NA),         copy(dt)[NA, b := NA])

Maybe we should rewrite SEXP assign(SEXP dt, SEXP rows, SEXP cols, SEXP newcolnames, SEXP values) in the way to remove newcolnames argument since everything can now be done by cols?

@joshhwuu
Copy link
Member

joshhwuu commented Jun 23, 2024

I've been trying to figure this one out. The 0L case is easy enough to add by changing the condition here to include 0:

data.table/src/assign.c

Lines 382 to 384 in 4e6ad97

if ((rowsd[i]<0 && rowsd[i]!=NA_INTEGER) || rowsd[i]>nrow)
error(_("i[%d] is %d which is out of range [1,nrow=%d]"), i+1, rowsd[i], nrow); // set() reaches here (test 2005.2); := reaches the same error in subset.c first
if (rowsd[i]>=1) numToDo++;

But for the NA_integer_ case, things are a little different.

AFAIU, when i is NA_integer_, it isn't a valid entry and is disregarded (numToDo is not incremented). If no valid rows are selected and no new columns are being created, we return the dt early, which makes sense:

data.table/src/assign.c

Lines 388 to 392 in 4e6ad97

if (numToDo==0) {
if (!length(newcolnames)) {
*_Last_updated = 0;
UNPROTECT(protecti);
return(dt); // all items of rows either 0 or NA. !length(newcolnames) for #759

This causes the behavior above as when we pass in NA or 0L, numToDo = 0, and since we are calling from set(), newcolnames = NULL always, causing the operation to return early without creation of the new column.

I figured that this was able to be solved by checking whether we call from set() or from within :=. Initially I wasn't able to get a way to do this, but I noticed the following:

data.table/src/assign.c

Lines 404 to 413 in 4e6ad97

// FR #2077 - set able to add new cols by reference
if (isString(cols)) {
PROTECT(tmp = chmatch(cols, names, 0)); protecti++;
buf = (int *) R_alloc(length(cols), sizeof(int));
int k=0;
for (int i=0; i<length(cols); ++i) {
if (INTEGER(tmp)[i] == 0) buf[k++] = i;
}
if (k>0) {
if (!isDataTable) error(_("set() on a data.frame is for changing existing columns, not adding new ones. Please use a data.table for that. data.table's are over-allocated and don't shallow copy."));

:= never reaches this block, which I found strange as I didn't know that SEXP cols from set() and from := are different types. I thought this was heavily implied as this code block included set() specific warnings.

I was able to pass all tests (listed above and old ones) by adding an additional check for isString(cols):

if (!length(newcolnames) && !isString(cols)) { // return early if no new columns to be added and called from ':='
    *_Last_updated = 0;
    UNPROTECT(protecti);
    return(dt); // all items of rows either 0 or NA. !length(newcolnames) for #759
  }

So what I'm wondering is, does set() always pass SEXP cols as a string while := does not? If so, is this check reliable enough to differentiate from := and set()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants