Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set() adds new cols when rows aren't updated #6204

Merged
merged 12 commits into from
Jul 29, 2024
Merged

Conversation

joshhwuu
Copy link
Member

Closes #5409

Brings set() behavior closer to :=, so that set() doesn't return early even if no rows are updated.

Question here: I noticed that isString() is used inside SEXP assign to differentiate between calls from set() and from :=, therefore this change worked. Although I just want to confirm that this is intended, as it isn't immediately obvious to me why cols from set() is always passed as a string while not from :=.

Copy link

github-actions bot commented Jun 25, 2024

Comparison Plot

Generated via commit a04231d

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 11 minutes and 34 seconds

Time taken to run atime::atime_pkg on the tests: 5 minutes and 51 seconds

inst/tests/tests.Rraw Outdated Show resolved Hide resolved
src/assign.c Outdated Show resolved Hide resolved
@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 12, 2024

oops tried to resolve merge conflict on tests.rraw and screwed up the diff.. will try to fix now

git is hard

Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great thanks!
Is this behavior documented? If not please update docs.
Please add a NEWS item.

@tdhock
Copy link
Member

tdhock commented Jul 18, 2024

related docs I found are ?set

set is a low-overhead loop-able version of ‘:=’
...
       i: Optional. Indicates the rows on which the values must be
          updated with. If not provided, implies _all rows_. The ‘:=’
          form is more powerful as it allows _subsets_ and ‘joins’
          based add/update columns by reference. See ‘Details’.

          In set, only integer type is allowed in ‘i’ indicating
          which rows ‘value’ should be assigned to. ‘NULL’ represents
          all rows more efficiently than creating a vector such as
          ‘1:nrow(x)’.

to me these docs imply that set only accepts integer i, but should be consistent with :=. Docs say that i should be integer indicating rows (so 1 to nrow) but they do not specify what happens when i values are outside of 1 to nrow.

I also find "docs" about what i is allowed to be in the error message below

> DT=data.table(x=1)
> set(DT, i=-1L, j="b", value=5)
Error in set(DT, i = -1L, j = "b", value = 5) : 
  i[1] is -1 which is out of range [1,nrow=1]
> set(DT, i=0L, j="b", value=5)
> set(DT, i=NA_integer_, j="b", value=5)
> DT
       x
   <num>
1:     1

The error message above implies that the only valid values for i are [1,nrow] (but missing/NA and 0 in i are not mentioned on ?set).

It seems to me that allowing 0L and NA_integer_ are inconsistent with this error message.

Also below we see that := allows NA_initeger_ and -1L, creating a new column with missing values, as when i=0L:

> DT[-1L, b := 5]
> DT[NA_integer_, c := 5]
> DT[2L, d := 5]
Error in `[.data.table`(DT, 2L, `:=`(c, 5)) : 
  i[1] is 2 which is out of range [1,nrow=1]
> DT
       x     b     c
   <num> <num> <num>
1:     1    NA    NA

So to summarise, we have in current master

i := set
-1L add new col error
0L add new col ignore
NA_integer_ add new col ignore
>nrow error error

For consistency and backward compatibility, we should probably change this to

i := set
-1L add new col add new col
0L add new col add new col
NA_integer_ add new col add new col
>nrow error error

and change the error message to i[1] is 2 but should not be larger than nrow=1

or just error everywhere (but that would be a breaking change, maybe there will be revdep issues)

@tdhock
Copy link
Member

tdhock commented Jul 18, 2024

The only test case I see which specifies the behavior for weird i is

test(2005.02, set(DT, 4L, "b", NA), error="i[1] is 4 which is out of range [1,nrow=3]")

So i=-1, 0 and NA_integer_ are unspecified behavior in tests as well as docs.

@tdhock
Copy link
Member

tdhock commented Jul 18, 2024

wdyt we should do about this @joshhwuu @MichaelChirico @mb706 @ben-schwen ?

@mb706
Copy link

mb706 commented Jul 18, 2024

I'd vote for your proposal making set consistent with :=. The current behaviour is what I stumbled over when replacing an instance of := with set(), expecting them to behave the same as long as i is integer-valued.

Maybe another testcase to consider would be

set(DT, i=integer(0), j="d", value=numeric(0))

This was also ignored previously and adds a new column as of this PR, which I also think is the right way.

@tdhock
Copy link
Member

tdhock commented Jul 18, 2024

you wrote

set(DT, i=integer(0), j="d", value=numeric(0))

but it is simpler/preferable to write
set(DT, i=0L, j="d", value=0)
literal numbers with L are integer, and without are numeric.
sorry my mistake I thought you wrote as.integer, I am tired today.

@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 18, 2024

I also prefer that we bring set behavior closer to :=, I believe this PR covers those cases mentioned above

@ben-schwen
Copy link
Member

ben-schwen commented Jul 18, 2024

@tdhock I think @mb706 actually meant zero length integer and numeric vectors as he wrote it.

NEWS.md Outdated Show resolved Hide resolved
@ben-schwen
Copy link
Member

ben-schwen commented Jul 18, 2024

Apart from news, LGTM

Might want to add mb706's set(copy(dt), i=integer(0), j="b", value=numeric(0) testcase.

@joshhwuu
Copy link
Member Author

joshhwuu commented Jul 22, 2024

@tdhock WDYT about changing the error to i[1] is 2 but should not be larger than nrow=1 as above, and mentioning in docs that NA, 0L and -1L given in i are not used and ignored

@MichaelChirico MichaelChirico added this to the 1.16.0 milestone Jul 28, 2024
@tdhock
Copy link
Member

tdhock commented Jul 29, 2024

mentioning in docs that NA, 0L and -1L given in i are not used and ignored

yes please add docs: missing values and non-positive values are ignored, positive values greater than nrow cause error.

@joshhwuu
Copy link
Member Author

@tdhock upon further investigation, it seems that negative indexing values allowed in DT[] is for non-selection of rows, for example:

DT = data.table(x = 1:5)
DT[-1L, b := 1]
DT
       x      b
   <int> <char>
1:     1   <NA>
2:     2      1
3:     3      1
4:     4      1
5:     5      1

Which I don't believe set() supports as is, and it would be better to use := in this situation. See ?set:

i: Optional. Indicates the rows on which the values must be
          updated with. If not provided, implies _all rows_. 
          The ‘:=’ form is more powerful as it allows _subsets_
          and ‘joins’ based add/update columns by reference. See
          ‘Details’.

I believe the correct way to go about this is to just mention missing values and 0L being ignored in the docs. LMK if you think otherwise.

@tdhock
Copy link
Member

tdhock commented Jul 29, 2024

yes that is moving in the right direction.
the more we dig the weirder it gets haha.

> DT=data.table(x=1:2)
> set(DT,-1,"x",0)
Error in set(DT, -1, "x", 0) : 
  i[1] is -1 which is out of range [1,nrow=2]
In addition: Warning message:
In set(DT, -1, "x", 0) :
  Coerced i from numeric to integer. Please pass integer for efficiency; e.g., 2L rather than 2
> DT[NA, x := 0]
> DT
       x
   <int>
1:     1
2:     2
> DT[c(-1,1), x := 0]
Error in `[.data.table`(DT, c(-1, 1), `:=`(x, 0)) : 
  Item 1 of i is -1 and item 2 is 1. Cannot mix positives and negatives.
> DT[c(-1,NA), x := 0]
Error in `[.data.table`(DT, c(-1, NA), `:=`(x, 0)) : 
  Item 1 of i is -1 and item 2 is NA. Cannot mix negatives and NA.
> DT[c(NA,1), x := 0]
> DT
       x
   <int>
1:     0
2:     2

do you think any of this should be documented? I guess we can leave it un-documented for now.

@joshhwuu
Copy link
Member Author

It would be difficult/time-consuming finding each of these cases and writing good concise documentation for them, so I agree that we can leave the documentation as-is for these niche cases

man/assign.Rd Outdated Show resolved Hide resolved
Copy link
Member

@MichaelChirico MichaelChirico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MichaelChirico MichaelChirico merged commit 0ed5502 into master Jul 29, 2024
5 checks passed
@MichaelChirico MichaelChirico deleted the setconsistency branch July 29, 2024 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

set() that does not update any rows should still add cols
6 participants