Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent Replacement of List Column with NULL #6167

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

joshhwuu
Copy link
Member

@joshhwuu joshhwuu commented Jun 4, 2024

Closes #5558

Previous behavior

From @tdhock:

In base R data.frame we can replace an element of a list column with NULL via:

> DF1=data.frame(L=I(list("A")),i=1)
> DF1$L=list(NULL)
> DF1
     L i
1 NULL 1

However in data.table, doing that results in deleting the list column entirely:

> DT1=data.table(L=list("A"),i=1)
> DT1$L=list(NULL)
> DT1
       i
   <num>
1:     1

This was reported to be inconsistent with column replacement with more than one row, see:

# old replacement of multiple rows, correct but inconsistent
> DT2=data.table(L=list("B","C"),i=1)
> DT2$L <- list(NULL,NULL)
> DT2
        L     i
   <list> <num>
1:            1
2:            1

Additionally, there was this inconsistency as well:

I can do this:

library(data.table)
DT1=data.table(L=list("A"),i=1)
DT1[, `:=`(L = list(NULL))]

that works, but oddly not

DT1[, L := list(NULL)]

which should be identical per data.table documentation.

Changes

Request: can we make the above code do a replacement (like base R data.frame) instead of deleting the column?

In assign.c, add a new check to see if passed in values is list(NULL). If so, replace the list column with a list of NULL(s) of the same length.

This is the new behavior:

DT = data.table(L = list("A"), i = 1L)
DT$L = list(NULL)
#       L i
# 1: NULL 1

# SAME
DT[, L := list(NULL)]
#       L i
# 1: NULL 1

# SAME
DT[, `:=`(L = list(NULL))]
#       L i
# 1: NULL 1

We no longer delete the column, instead replace the column rows with NULLs.

This PR also changes behavior when doing more than one row, to be more consistent with data.frame replacement:

# data.frame replacement:
DF = data.frame(L = I(list("B", "C")), i = 1L)
DF$L = list(NULL)
#      L i
# 1 NULL 1
# 2 NULL 1

# old
DT = data.table(L = list("B", "C"), i = 1L)
DT$L = list(NULL)
#        i
# 1:     1
# 2:     1

# new
DT = data.table(L = list("B", "C"), i = 1L)
DT$L = list(NULL)
#       L i
# 1: NULL 1
# 2: NULL 1

Of course, this works with the other assignment methods.

Had to change one old test, test(2058.20) to reflect the new behavior as well.

Copy link

github-actions bot commented Jun 4, 2024

Comparison Plot

Generated via commit ac8ce38

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 20 seconds

Time taken to run atime::atime_pkg on the tests: 3 minutes and 25 seconds

Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes and tests look good, can you please add a NEWS item.

Copy link
Member

@Anirban166 Anirban166 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the NEWS entry and the changes look good to me as well!

As for the tests they are good too and work (just tested) but I think it might be neat to comment or separate them out a bit for one to quickly see what each test is doing and how is it different from the other. For e.g., for your tests from top to bottom in order, it could be a comment that conveys that you replaced a list column with standard assignment to NULL, did the same but using the := syntax or modified in-place, compared with another data.table, replaced multiple elements with NULL and then followed up with tests similar to the single element replacement case.

@ben-schwen
Copy link
Member

I know I'm quite late to the party, but in my opinion, ideally, we would bring the assign.c parts of set and := closer together. Ultimately, this would result in that we can scrub the newcolnames argument from SEXP assign()

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 4, 2024

Hm.. not sure I understand what you mean here, I thought the simplest fix would be in assign.c's internal behavior. Could you elaborate on what you mean by bringing the assign.c parts of set and := closer together?

@tdhock
Copy link
Member

tdhock commented Jun 5, 2024

this may be a breaking change (revdep checks could fail as a result)
but probably good / worth making the change for consistency.
I wonder if the documentation needs updating? @avimallu wrote "should be identical as per data.table documentation" -- what part of the documentation was that, and should it be clarified to reflect this change?

@ben-schwen
Copy link
Member

Hm.. not sure I understand what you mean here, I thought the simplest fix would be in assign.c's internal behavior. Could you elaborate on what you mean by bringing the assign.c parts of set and := closer together?

Currently, there are multiple ways to alter/add new columns to a data.table, e.g. via := or set.
However, set and := both call SEXP assign but use different parts in the underlying C code. I think the goal would be that both use the same code reducing the complexity and the code to maintain in our code base.

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 5, 2024

Oh I see. Do you propose that we try to include the changes in this PR, or is it worth filing a separate issue?

@ben-schwen
Copy link
Member

There are already multiple issues about the divergence of set and :=. It does not have to be this PR, I just thought that this might be an interesting topic to work on in GSOC (maybe as stacked PR)

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 6, 2024

I wonder if the documentation needs updating? @avimallu wrote "should be identical as per data.table documentation" -- what part of the documentation was that, and should it be clarified to reflect this change?

@tdhock
While I can't say for sure which part of the documentation @avimallu was referring to, here are a few parts of the data.table documentation talking about the two forms:

Reference Semantics Vignette

b) The := operator

It can be used in j in two ways:

(a) The LHS := RHS form

DT[, c("colA", "colB", ...) := list(valA, valB, ...)]

# when you have only one column to assign to you
# can drop the quotes and list(), for convenience
DT[, colA := valA]

(b) The functional form

DT[, `:=`(colA = valA, # valA is assigned to colA
          colB = valB, # valB is assigned to colB
          ...
)]
  • In (a), LHS takes a character vector of column names and RHS a list of values. RHS just needs to be a list, irrespective of how its generated (e.g., using lapply(), list(), mget(), mapply() etc.). This form is usually easy to program with and is particularly useful when you don't know the columns to assign values to in advance.

  • On the other hand, (b) is handy if you would like to jot some comments down for later.

This vignette implies that the result of the two forms exist, with the primary difference being syntax and the functional form being more chatty. IMO, it implies(?) that the two are the same.

Assignment by reference doc

# 1. LHS := RHS form
DT[i, LHS := RHS, by = ...]
DT[i, c("LHS1", "LHS2") := list(RHS1, RHS2), by = ...]

# 2a. Functional form with `:=`
DT[i, `:=`(LHS1 = RHS1,
           LHS2 = RHS2,
           ...), by = ...]

# 2b. Functional form with let
DT[i, let(LHS1 = RHS1,
           LHS2 = RHS2,
           ...), by = ...]

I think this documentation also implies that the different usages both work. It does state that let and functional form are equivalent. So, I will add some tests in this PR to check that using let works the same as well, for thoroughness :)

Although these two documentations imply that the results of either form are largely the same, I haven't found anywhere in the documentation that says it is always guaranteed to be the same. While searching this up on google, I found this stack overflow thread talking about different results when using functional form and assigning by reference: https://stackoverflow.com/questions/44067091/different-results-for-standard-form-and-functional-form-of-data-table-assigne-by

Jan explained here that there are slight differences in how RHS is handled causing a difference in output between the two forms depending on whether the data we are assigning is a vector or a list:

dt <- data.table(a = c('a','b','c'))
l <- list(v)

print(copy(dt)[, new := l])
print(copy(dt)[, `:=` (new = l)])
        a    new
   <char> <char>
1:      a      A
2:      b      B
3:      c      C
        a    new
   <char> <list>
1:      a  A,B,C
2:      b  A,B,C
3:      c  A,B,C

This is still true as of current master (just tested), so I believe we shouldn't explicitly state that the results will be the exact same. But we should note that in most cases, the two forms are the same, which I believe the current documentation implies.

@avimallu
Copy link
Contributor

avimallu commented Jun 6, 2024

This vignette implies that the result of the two forms exist, with the primary difference being syntax and the functional form being more chatty. IMO, it implies(?) that the two are the same.

This was my interpretation when I commented on the issue!

@tdhock
Copy link
Member

tdhock commented Jun 7, 2024

would be good to clarify the docs, explicitly write they they should be the same, and when they are expected to be different

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 7, 2024

TBH @tdhock I'm still a little confused on the exact differences between standard and functional form of assigning by reference. I want to ask for some of Jan's (and others) input to help me understand it. Plus it'll keep the logs on this PR a little clearer, as this PR didn't intend to fix documentation but is only slightly related, WDYT about filing a separate issue for that?

Otherwise, if you think the vignette update is clear enough then we could keep it in this PR, however I'm having trouble reasoning why exactly the above behavior happens. My line of thinking at the moment is that because := is like an alias to list in functional form, then when we do:

dt[, `:=`(new = list(1:3))]

it is essentially equivalent to:

dt[, new := list(new = list(1:3))]

Since this is true (just tried), I wonder how the wrapping of RHS by list in standard form vs not wrapping in functional form is relevant

Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty close to what I had in mind, thanks!

vignettes/datatable-reference-semantics.Rmd Outdated Show resolved Hide resolved
Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of lists is too specific, just say list

man/assign.Rd Outdated Show resolved Hide resolved
@@ -15369,7 +15369,7 @@ L = list(1:3, NULL, 4:6)
test(2058.18, length(L), 3L)
test(2058.19, as.data.table(L), data.table(V1=1:3, V2=4:6)) # V2 not V3 # no
DT = data.table(a=1:3, b=c(4,5,6))
test(2058.20, DT[,b:=list(NULL)], data.table(a=1:3)) # no
test(2058.20, DT[,b:=list(NULL)], data.table(a=1:3, b=list(NULL))) # no
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a bit surprising to me. I suspect this will cause revdep breakage. Here are some examples:

https://github.com/search?q=lang%3AR%20%2F%3A%3D%5Cs*list%5C(%5Cs*NULL%5Cs*%5C)%2F&type=code

It would help to do a before/after on this PR of various ways to add columns, e.g. combinations of

  • adding 1 column vs. multiple columns
  • deleting 1 column vs. multiple columns
  • adding & deleting columns in the same query
  • operations on 1-row vs multi-row tables
  • operations on list- vs non-list columns

Your examples in the main PR body are good but only cover a small part of the above. It may be that we need to cause some breakage for consistency, but we need to understand completely what's changing, what the recommended alternative is, why we can't fix the issue back-compatibly, etc.

e.g. for this test, IINM there's already a recommended way to add a new list column as a plonk (full-column replacement): b := .(list(NULL)). Is it not possible (or just strongly ill-advised?) to allow both codepaths to continue to work:

b := list(NULL) # same as b := NULL, column deletion b/c 'b' is not a list to begin with
b := .(list(NULL)) # overwrite 'b' with an 'empty' list column

This is a tricky area because I think there's some inherent ambiguity here with list columns.

Copy link
Member Author

@joshhwuu joshhwuu Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently working on some tests to run against this PR and previous behavior to see if there are any unexpected changes, will update here.

e.g. for this test, IINM there's already a recommended way to add a new list column as a plonk (full-column replacement): b := .(list(NULL)). Is it not possible (or just strongly ill-advised?) to allow both codepaths to continue to work:

I believe this only works if b is a list column, in the test mentioned above this would throw an error:
'list' object cannot be coerced to type 'double'

However this test passes on the PR:

DT = data.table(b = list(1:3))
test(2264.9, copy(DT)[, b := list(NULL)], copy(DT)[, b := .(list(NULL))])

So I think both paths work on the PR

Copy link
Member Author

@joshhwuu joshhwuu Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some tests, except I just realized that the changes don't allow us to add any new columns with list(NULL) in standard form quite yet. I believe I only addressed the replacement of columns with list(NULL). Is adding a new null list column in standard form something we want to be able to do as well? Just want to note that functional form works and this is consistent with the documentation update as well

Everything else looks consistent at least

dt = data.table(a = 1:3)
DT[, b := list(NULL)] # warning, doesn't add the new column
DT[, `:=`(b = list(NULL))] # works

# same as
data.table(a = 1:3, b = list(NULL))
# can be done with
DT[, b := .(list(NULL))]
# or
DT[, b := list(list(NULL))]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is adding a new null list column in standard form something we want to be able to do as well?

I'm not sure it's possible to do unambiguously. What's most important is that users can do what they need to, and there is a way to add "empty" list columns as you noted: b := .(list(NULL)). We'd only worry about continuing to support b := list(NULL) if there was a back-compatibility issue, which there's not in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM. How do you suppose we proceed with this issue? Are the revdep issues worth making for the sake of consistency? Primary goal for me is to close issues, so I'm open to suggestions 😸

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you suppose we proceed with this issue?

Let's start with the proposed list of the various ways to add/remove columns for list/non-list types. It can also serve as a piece of documentation to add somewhere (maybe FAQ).

Copy link
Member Author

@joshhwuu joshhwuu Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to see, will leave as a separate comment

@tdhock
Copy link
Member

tdhock commented Jun 12, 2024

i posted the issue, and i'm totally fine with not changing any features, and instead updating the docs, as long as they explain why this inconsistency exists (maybe there is some reason that I do not understand?) If there is no strong reason to keep existing functionality (other than revdeps), would be nice to increase consistency (reduce user surprise)

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 12, 2024

Here's a table of current behavior and proposed changes, @MichaelChirico LMK if there's anything else you'd like to see. For most of these I added some tests (although I trust most of it has been thoroughly tested by previous tests, ie adding, removing, etc.).

TLDR:

  1. Replacement of a single-row list column with list(NULL) replaces the column with an empty list, instead of deleting the column (new, consistent with data.frame).
  2. Replacement of a multi-row list column is the same except RHS can now be just list(NULL) (or we don't have to do list(NULL, NULL, ...) (new, consistent with data.frame).
  3. Adding a new list column isn't changed at all.
  4. To remove a list column, we now can only assign the list column to NULL and no longer have the option to use list(NULL).
  5. Replacement/addition/removal of non-list types with list(NULL) doesn't work as expected (throws an error, same as now).
Type Old Proposed changes
Replacement of a list column in a single-row data.table with `list(NULL)` with $ or standard form with :=.
# Deletes list column
DT = data.table(L = list('A'), i = 1)
DT$L = list(NULL)
# or
DT[, L := list(NULL)]
#        i
#    <num>
# 1:     1
# Replaces list column with empty list, consistent with data.frame 
# and consistent with current functional form
DT = data.table(L = list('A'), i = 1)
DT$L = list(NULL)
# or
DT[, L := list(NULL)]
# or
DT[, `:=`(L = list(NULL))] # before AND after PR the same
#        i      L
#    <num> <list>
# 1:     1 [NULL]
Replacement of a list column in a multi-row data.table with `list(NULL)` with $ or with :=.
# Replaces list column with empty list
DT = data.table(L = list('A', 'B'), i = 1)
DT$L = list(NULL, NULL)
# or
DT[, L := list(NULL, NULL)]
# or
DT[, `:=`(L = list(NULL))]
#        i      L
#    <num> <list>
# 1:     1 [NULL]
# 2:     1 [NULL]
# Does the same thing, but removes the need to specify "NULL" for every row
DT = data.table(L = list('A', 'B'), i = 1)
DT$L = list(NULL)
# or
DT[, L := list(NULL)]
# or
DT[, `:=`(L = list(NULL))] # Again, same before and after
#        i      L
#    <num> <list>
# 1:     1 [NULL]
# 2:     1 [NULL]
Adding an empty list column to a single-row data.table.
# Use a plonk when necessary
DT = data.table(L = list('A'), i = 1)
DT$D = list(list(NULL))
# or
DT[, D := .(list(NULL))]
# or
DT[, `:=`(D = list(NULL))]
#         L     i      D
#    <list> <num> <list>
# 1:      A     1 [NULL]
No changes here!
Adding an empty list column to a multi-row data.table.
# Use a plonk when necessary
DT = data.table(L = list('A', 'B'), i = 1)
DT$D = list(list(NULL))
# or
DT[, D := .(list(NULL))]
# or
DT[, `:=`(D = list(NULL))]
#         L     i      D
#    <list> <num> <list>
# 1:      A     1 [NULL]
# 2:      C     1 [NULL]
Again, no changes here!
Removal of a list column in a single-row data.table.
# Removes the list column with either NULL 
# or list(NULL) (unless functional form)
DT = data.table(L = list('A'), i = 1)
DT$L = NULL # or list(NULL)
# or
DT[, L := NULL] # or list(NULL)
# or
DT[, `:=`(L = NULL)]
#        i
#    <num>
# 1:     1
# Removes the list column with NULL
# setting to list(NULL) replaces column with empty list
DT = data.table(L = list('A'), i = 1)
DT$L = NULL
# or
DT[, L := NULL]
# or
DT[, `:=`(L = NULL)]
#        i
#    <num>
# 1:     1

@tdhock
Copy link
Member

tdhock commented Jun 13, 2024

wow that is a really great comparison table

@Anirban166
Copy link
Member

Agreed, that's pretty comprehensive. @joshhwuu good work!

Copy link
Member

@tdhock tdhock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, it is better to have only one way of doing something, if possible, in my opinion, and this PR moves toward that ideal.
let's wait to see what Michael says.

@MichaelChirico
Copy link
Member

Agree it's a great table! I want to read it again carefully -- hope to find the time this week. So far, I agree it looks like improved behavior.

@joshhwuu
Copy link
Member Author

Hmm.. It seems that there's been an oversight on my end. While revisiting the code/documentation change again, I realized that since we know that the functional form wraps RHS in a list, SEXP assign will interpret this as a null replacement instead of deletion. I tested and it seems that I was right:

> DT = data.table(L = list('A'), i = 1)
> DT[, `:=`(L = NULL)]
> DT
#         L     i
#    <list> <num>
# 1: [NULL]     1

I think this can be fixed, but I'll need some time to think of a good solution, suggestions are welcome. I'll be reorganizing the unit tests to be more comprehensive and use all forms of assignment to thoroughly test. Thanks for everyone's patience!

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 20, 2024

Organized and added some tests, changed list wrapping behavior of rhs to not wrap (functional form only) when rhs is a singular NULL, thus allowing us to remove columns by assigning to NULL with functional form, as listed in the table above.

@tdhock
Copy link
Member

tdhock commented Jun 20, 2024

looks good to me, thanks for the extensive tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent replacement of list column element with NULL in table with 1 row
7 participants