-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent Replacement of List Column with NULL #6167
base: master
Are you sure you want to change the base?
Conversation
Generated via commit ac8ce38 Download link for the artifact containing the test results: ↓ atime-results.zip Time taken to finish the standard R installation steps: 12 minutes and 20 seconds Time taken to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes and tests look good, can you please add a NEWS item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the NEWS entry and the changes look good to me as well!
As for the tests they are good too and work (just tested) but I think it might be neat to comment or separate them out a bit for one to quickly see what each test is doing and how is it different from the other. For e.g., for your tests from top to bottom in order, it could be a comment that conveys that you replaced a list column with standard assignment to NULL, did the same but using the := syntax or modified in-place, compared with another data.table
, replaced multiple elements with NULL and then followed up with tests similar to the single element replacement case.
I know I'm quite late to the party, but in my opinion, ideally, we would bring the |
Hm.. not sure I understand what you mean here, I thought the simplest fix would be in |
this may be a breaking change (revdep checks could fail as a result) |
Currently, there are multiple ways to alter/add new columns to a |
Oh I see. Do you propose that we try to include the changes in this PR, or is it worth filing a separate issue? |
There are already multiple issues about the divergence of set and :=. It does not have to be this PR, I just thought that this might be an interesting topic to work on in GSOC (maybe as stacked PR) |
@tdhock Reference Semantics Vignette
This vignette implies that the result of the two forms exist, with the primary difference being syntax and the functional form being more chatty. IMO, it implies(?) that the two are the same. Assignment by reference doc
I think this documentation also implies that the different usages both work. It does state that let and functional form are equivalent. So, I will add some tests in this PR to check that using Although these two documentations imply that the results of either form are largely the same, I haven't found anywhere in the documentation that says it is always guaranteed to be the same. While searching this up on google, I found this stack overflow thread talking about different results when using functional form and assigning by reference: https://stackoverflow.com/questions/44067091/different-results-for-standard-form-and-functional-form-of-data-table-assigne-by Jan explained here that there are slight differences in how RHS is handled causing a difference in output between the two forms depending on whether the data we are assigning is a vector or a list: dt <- data.table(a = c('a','b','c'))
l <- list(v)
print(copy(dt)[, new := l])
print(copy(dt)[, `:=` (new = l)])
a new
<char> <char>
1: a A
2: b B
3: c C
a new
<char> <list>
1: a A,B,C
2: b A,B,C
3: c A,B,C This is still true as of current master (just tested), so I believe we shouldn't explicitly state that the results will be the exact same. But we should note that in most cases, the two forms are the same, which I believe the current documentation implies. |
This was my interpretation when I commented on the issue! |
would be good to clarify the docs, explicitly write they they should be the same, and when they are expected to be different |
TBH @tdhock I'm still a little confused on the exact differences between standard and functional form of assigning by reference. I want to ask for some of Jan's (and others) input to help me understand it. Plus it'll keep the logs on this PR a little clearer, as this PR didn't intend to fix documentation but is only slightly related, WDYT about filing a separate issue for that? Otherwise, if you think the vignette update is clear enough then we could keep it in this PR, however I'm having trouble reasoning why exactly the above behavior happens. My line of thinking at the moment is that because dt[, `:=`(new = list(1:3))] it is essentially equivalent to: dt[, new := list(new = list(1:3))] Since this is true (just tried), I wonder how the wrapping of RHS by list in standard form vs not wrapping in functional form is relevant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty close to what I had in mind, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list of lists is too specific, just say list
@@ -15369,7 +15369,7 @@ L = list(1:3, NULL, 4:6) | |||
test(2058.18, length(L), 3L) | |||
test(2058.19, as.data.table(L), data.table(V1=1:3, V2=4:6)) # V2 not V3 # no | |||
DT = data.table(a=1:3, b=c(4,5,6)) | |||
test(2058.20, DT[,b:=list(NULL)], data.table(a=1:3)) # no | |||
test(2058.20, DT[,b:=list(NULL)], data.table(a=1:3, b=list(NULL))) # no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is a bit surprising to me. I suspect this will cause revdep breakage. Here are some examples:
https://github.com/search?q=lang%3AR%20%2F%3A%3D%5Cs*list%5C(%5Cs*NULL%5Cs*%5C)%2F&type=code
It would help to do a before/after on this PR of various ways to add columns, e.g. combinations of
- adding 1 column vs. multiple columns
- deleting 1 column vs. multiple columns
- adding & deleting columns in the same query
- operations on 1-row vs multi-row tables
- operations on list- vs non-list columns
Your examples in the main PR body are good but only cover a small part of the above. It may be that we need to cause some breakage for consistency, but we need to understand completely what's changing, what the recommended alternative is, why we can't fix the issue back-compatibly, etc.
e.g. for this test, IINM there's already a recommended way to add a new list column as a plonk (full-column replacement): b := .(list(NULL))
. Is it not possible (or just strongly ill-advised?) to allow both codepaths to continue to work:
b := list(NULL) # same as b := NULL, column deletion b/c 'b' is not a list to begin with
b := .(list(NULL)) # overwrite 'b' with an 'empty' list column
This is a tricky area because I think there's some inherent ambiguity here with list columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently working on some tests to run against this PR and previous behavior to see if there are any unexpected changes, will update here.
e.g. for this test, IINM there's already a recommended way to add a new list column as a plonk (full-column replacement): b := .(list(NULL)). Is it not possible (or just strongly ill-advised?) to allow both codepaths to continue to work:
I believe this only works if b is a list column, in the test mentioned above this would throw an error:
'list' object cannot be coerced to type 'double'
However this test passes on the PR:
DT = data.table(b = list(1:3))
test(2264.9, copy(DT)[, b := list(NULL)], copy(DT)[, b := .(list(NULL))])
So I think both paths work on the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some tests, except I just realized that the changes don't allow us to add any new columns with list(NULL)
in standard form quite yet. I believe I only addressed the replacement of columns with list(NULL)
. Is adding a new null list column in standard form something we want to be able to do as well? Just want to note that functional form works and this is consistent with the documentation update as well
Everything else looks consistent at least
dt = data.table(a = 1:3)
DT[, b := list(NULL)] # warning, doesn't add the new column
DT[, `:=`(b = list(NULL))] # works
# same as
data.table(a = 1:3, b = list(NULL))
# can be done with
DT[, b := .(list(NULL))]
# or
DT[, b := list(list(NULL))]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is adding a new null list column in standard form something we want to be able to do as well?
I'm not sure it's possible to do unambiguously. What's most important is that users can do what they need to, and there is a way to add "empty" list columns as you noted: b := .(list(NULL))
. We'd only worry about continuing to support b := list(NULL)
if there was a back-compatibility issue, which there's not in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM. How do you suppose we proceed with this issue? Are the revdep issues worth making for the sake of consistency? Primary goal for me is to close issues, so I'm open to suggestions 😸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you suppose we proceed with this issue?
Let's start with the proposed list of the various ways to add/remove columns for list/non-list types. It can also serve as a piece of documentation to add somewhere (maybe FAQ).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to see, will leave as a separate comment
i posted the issue, and i'm totally fine with not changing any features, and instead updating the docs, as long as they explain why this inconsistency exists (maybe there is some reason that I do not understand?) If there is no strong reason to keep existing functionality (other than revdeps), would be nice to increase consistency (reduce user surprise) |
Here's a table of current behavior and proposed changes, @MichaelChirico LMK if there's anything else you'd like to see. For most of these I added some tests (although I trust most of it has been thoroughly tested by previous tests, ie adding, removing, etc.). TLDR:
|
wow that is a really great comparison table |
Agreed, that's pretty comprehensive. @joshhwuu good work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me, it is better to have only one way of doing something, if possible, in my opinion, and this PR moves toward that ideal.
let's wait to see what Michael says.
Agree it's a great table! I want to read it again carefully -- hope to find the time this week. So far, I agree it looks like improved behavior. |
Hmm.. It seems that there's been an oversight on my end. While revisiting the code/documentation change again, I realized that since we know that the functional form wraps > DT = data.table(L = list('A'), i = 1)
> DT[, `:=`(L = NULL)]
> DT
# L i
# <list> <num>
# 1: [NULL] 1 I think this can be fixed, but I'll need some time to think of a good solution, suggestions are welcome. I'll be reorganizing the unit tests to be more comprehensive and use all forms of assignment to thoroughly test. Thanks for everyone's patience! |
Organized and added some tests, changed list wrapping behavior of |
looks good to me, thanks for the extensive tests |
Closes #5558
Previous behavior
From @tdhock:
This was reported to be inconsistent with column replacement with more than one row, see:
Additionally, there was this inconsistency as well:
Changes
In assign.c, add a new check to see if passed in
values
islist(NULL)
. If so, replace the list column with a list of NULL(s) of the same length.This is the new behavior:
We no longer delete the column, instead replace the column rows with NULLs.
This PR also changes behavior when doing more than one row, to be more consistent with
data.frame
replacement:Of course, this works with the other assignment methods.
Had to change one old test,
test(2058.20)
to reflect the new behavior as well.