Consistent Replacement of List Column with NULL #6167

joshhwuu · 2024-06-04T00:59:36Z

Previous behavior

In base R data.frame we can replace an element of a list column with NULL via:

> DF1=data.frame(L=I(list("A")),i=1)
> DF1$L=list(NULL)
> DF1
     L i
1 NULL 1

However in data.table, doing that results in deleting the list column entirely:

> DT1=data.table(L=list("A"),i=1)
> DT1$L=list(NULL)
> DT1
       i
   <num>
1:     1

This was reported to be inconsistent with column replacement with more than one row, see:

# old replacement of multiple rows, correct but inconsistent
> DT2=data.table(L=list("B","C"),i=1)
> DT2$L <- list(NULL,NULL)
> DT2
        L     i
   <list> <num>
1:            1
2:            1

Additionally, there was this inconsistency as well:

I can do this:
library(data.table)
DT1=data.table(L=list("A"),i=1)
DT1[, `:=`(L = list(NULL))]
that works, but oddly not
DT1[, L := list(NULL)]
which should be identical per data.table documentation.

Changes

Request: can we make the above code do a replacement (like base R data.frame) instead of deleting the column?

In assign.c, add a new check to see if passed in values is list(NULL). If so, replace the list column with a list of NULL(s) of the same length.

This is the new behavior:

DT = data.table(L = list("A"), i = 1L)
DT$L = list(NULL)
#       L i
# 1: NULL 1

# SAME
DT[, L := list(NULL)]
#       L i
# 1: NULL 1

# SAME
DT[, `:=`(L = list(NULL))]
#       L i
# 1: NULL 1

We no longer delete the column, instead replace the column rows with NULLs.

This PR also changes behavior when doing more than one row, to be more consistent with data.frame replacement:

# data.frame replacement:
DF = data.frame(L = I(list("B", "C")), i = 1L)
DF$L = list(NULL)
#      L i
# 1 NULL 1
# 2 NULL 1

# old
DT = data.table(L = list("B", "C"), i = 1L)
DT$L = list(NULL)
#        i
# 1:     1
# 2:     1

# new
DT = data.table(L = list("B", "C"), i = 1L)
DT$L = list(NULL)
#       L i
# 1: NULL 1
# 2: NULL 1

Of course, this works with the other assignment methods.

Had to change one old test, test(2058.20) to reflect the new behavior as well.

github-actions · 2024-06-04T01:17:11Z

Generated via commit ac8ce38

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 20 seconds

Time taken to run atime::atime_pkg on the tests: 3 minutes and 25 seconds

tdhock

these changes and tests look good, can you please add a NEWS item.

Anirban166

Thanks for adding the NEWS entry and the changes look good to me as well!

As for the tests they are good too and work (just tested) but I think it might be neat to comment or separate them out a bit for one to quickly see what each test is doing and how is it different from the other. For e.g., for your tests from top to bottom in order, it could be a comment that conveys that you replaced a list column with standard assignment to NULL, did the same but using the := syntax or modified in-place, compared with another data.table, replaced multiple elements with NULL and then followed up with tests similar to the single element replacement case.

ben-schwen · 2024-06-04T21:12:56Z

I know I'm quite late to the party, but in my opinion, ideally, we would bring the assign.c parts of set and := closer together. Ultimately, this would result in that we can scrub the newcolnames argument from SEXP assign()

joshhwuu · 2024-06-04T22:17:43Z

Hm.. not sure I understand what you mean here, I thought the simplest fix would be in assign.c's internal behavior. Could you elaborate on what you mean by bringing the assign.c parts of set and := closer together?

tdhock · 2024-06-05T00:59:35Z

this may be a breaking change (revdep checks could fail as a result)
but probably good / worth making the change for consistency.
I wonder if the documentation needs updating? @avimallu wrote "should be identical as per data.table documentation" -- what part of the documentation was that, and should it be clarified to reflect this change?

ben-schwen · 2024-06-05T14:57:41Z

Hm.. not sure I understand what you mean here, I thought the simplest fix would be in assign.c's internal behavior. Could you elaborate on what you mean by bringing the assign.c parts of set and := closer together?

Currently, there are multiple ways to alter/add new columns to a data.table, e.g. via := or set.
However, set and := both call SEXP assign but use different parts in the underlying C code. I think the goal would be that both use the same code reducing the complexity and the code to maintain in our code base.

joshhwuu · 2024-06-05T18:54:50Z

Oh I see. Do you propose that we try to include the changes in this PR, or is it worth filing a separate issue?

ben-schwen · 2024-06-05T19:16:52Z

There are already multiple issues about the divergence of set and :=. It does not have to be this PR, I just thought that this might be an interesting topic to work on in GSOC (maybe as stacked PR)

joshhwuu · 2024-06-06T19:57:16Z

I wonder if the documentation needs updating? @avimallu wrote "should be identical as per data.table documentation" -- what part of the documentation was that, and should it be clarified to reflect this change?

@tdhock
While I can't say for sure which part of the documentation @avimallu was referring to, here are a few parts of the data.table documentation talking about the two forms:

Reference Semantics Vignette

b) The := operator

It can be used in j in two ways:

(a) The LHS := RHS form
DT[, c("colA", "colB", ...) := list(valA, valB, ...)]

# when you have only one column to assign to you
# can drop the quotes and list(), for convenience
DT[, colA := valA]
(b) The functional form
DT[, `:=`(colA = valA, # valA is assigned to colA
          colB = valB, # valB is assigned to colB
          ...
)]
In (a), LHS takes a character vector of column names and RHS a list of values. RHS just needs to be a list, irrespective of how its generated (e.g., using lapply(), list(), mget(), mapply() etc.). This form is usually easy to program with and is particularly useful when you don't know the columns to assign values to in advance.

On the other hand, (b) is handy if you would like to jot some comments down for later.

This vignette implies that the result of the two forms exist, with the primary difference being syntax and the functional form being more chatty. IMO, it implies(?) that the two are the same.

Assignment by reference doc

# 1. LHS := RHS form
DT[i, LHS := RHS, by = ...]
DT[i, c("LHS1", "LHS2") := list(RHS1, RHS2), by = ...]

# 2a. Functional form with `:=`
DT[i, `:=`(LHS1 = RHS1,
           LHS2 = RHS2,
           ...), by = ...]

# 2b. Functional form with let
DT[i, let(LHS1 = RHS1,
           LHS2 = RHS2,
           ...), by = ...]

I think this documentation also implies that the different usages both work. It does state that let and functional form are equivalent. So, I will add some tests in this PR to check that using let works the same as well, for thoroughness :)

Although these two documentations imply that the results of either form are largely the same, I haven't found anywhere in the documentation that says it is always guaranteed to be the same. While searching this up on google, I found this stack overflow thread talking about different results when using functional form and assigning by reference: https://stackoverflow.com/questions/44067091/different-results-for-standard-form-and-functional-form-of-data-table-assigne-by

Jan explained here that there are slight differences in how RHS is handled causing a difference in output between the two forms depending on whether the data we are assigning is a vector or a list:

dt <- data.table(a = c('a','b','c'))
l <- list(v)

print(copy(dt)[, new := l])
print(copy(dt)[, `:=` (new = l)])
        a    new
   <char> <char>
1:      a      A
2:      b      B
3:      c      C
        a    new
   <char> <list>
1:      a  A,B,C
2:      b  A,B,C
3:      c  A,B,C

This is still true as of current master (just tested), so I believe we shouldn't explicitly state that the results will be the exact same. But we should note that in most cases, the two forms are the same, which I believe the current documentation implies.

avimallu · 2024-06-06T21:57:21Z

This vignette implies that the result of the two forms exist, with the primary difference being syntax and the functional form being more chatty. IMO, it implies(?) that the two are the same.

This was my interpretation when I commented on the issue!

tdhock · 2024-06-07T12:35:52Z

would be good to clarify the docs, explicitly write they they should be the same, and when they are expected to be different

joshhwuu · 2024-06-07T21:16:52Z

TBH @tdhock I'm still a little confused on the exact differences between standard and functional form of assigning by reference. I want to ask for some of Jan's (and others) input to help me understand it. Plus it'll keep the logs on this PR a little clearer, as this PR didn't intend to fix documentation but is only slightly related, WDYT about filing a separate issue for that?

Otherwise, if you think the vignette update is clear enough then we could keep it in this PR, however I'm having trouble reasoning why exactly the above behavior happens. My line of thinking at the moment is that because := is like an alias to list in functional form, then when we do:

dt[, `:=`(new = list(1:3))]

it is essentially equivalent to:

dt[, new := list(new = list(1:3))]

Since this is true (just tried), I wonder how the wrapping of RHS by list in standard form vs not wrapping in functional form is relevant

…ble/data.table into consistentcolreplacement

vignettes/datatable-reference-semantics.Rmd

tdhock

this is pretty close to what I had in mind, thanks!

vignettes/datatable-reference-semantics.Rmd

tdhock

list of lists is too specific, just say list

man/assign.Rd

vignettes/datatable-reference-semantics.Rmd

MichaelChirico · 2024-06-09T19:10:00Z

inst/tests/tests.Rraw

@@ -15369,7 +15369,7 @@ L = list(1:3, NULL, 4:6)
 test(2058.18, length(L), 3L)
 test(2058.19, as.data.table(L), data.table(V1=1:3, V2=4:6))  # V2 not V3        # no
 DT = data.table(a=1:3, b=c(4,5,6))
-test(2058.20, DT[,b:=list(NULL)], data.table(a=1:3))                            # no
+test(2058.20, DT[,b:=list(NULL)], data.table(a=1:3, b=list(NULL)))              # no


Hmm, this is a bit surprising to me. I suspect this will cause revdep breakage. Here are some examples:

https://github.com/search?q=lang%3AR%20%2F%3A%3D%5Cs*list%5C(%5Cs*NULL%5Cs*%5C)%2F&type=code

It would help to do a before/after on this PR of various ways to add columns, e.g. combinations of

adding 1 column vs. multiple columns

deleting 1 column vs. multiple columns

adding & deleting columns in the same query

operations on 1-row vs multi-row tables

operations on list- vs non-list columns

Your examples in the main PR body are good but only cover a small part of the above. It may be that we need to cause some breakage for consistency, but we need to understand completely what's changing, what the recommended alternative is, why we can't fix the issue back-compatibly, etc.

e.g. for this test, IINM there's already a recommended way to add a new list column as a plonk (full-column replacement): b := .(list(NULL)). Is it not possible (or just strongly ill-advised?) to allow both codepaths to continue to work:

b := list(NULL) # same as b := NULL, column deletion b/c 'b' is not a list to begin with b := .(list(NULL)) # overwrite 'b' with an 'empty' list column

This is a tricky area because I think there's some inherent ambiguity here with list columns.

I'm currently working on some tests to run against this PR and previous behavior to see if there are any unexpected changes, will update here.

e.g. for this test, IINM there's already a recommended way to add a new list column as a plonk (full-column replacement): b := .(list(NULL)). Is it not possible (or just strongly ill-advised?) to allow both codepaths to continue to work:

I believe this only works if b is a list column, in the test mentioned above this would throw an error:
'list' object cannot be coerced to type 'double'

However this test passes on the PR:

DT = data.table(b = list(1:3)) test(2264.9, copy(DT)[, b := list(NULL)], copy(DT)[, b := .(list(NULL))])

So I think both paths work on the PR

Added some tests, except I just realized that the changes don't allow us to add any new columns with list(NULL) in standard form quite yet. I believe I only addressed the replacement of columns with list(NULL). Is adding a new null list column in standard form something we want to be able to do as well? Just want to note that functional form works and this is consistent with the documentation update as well

Everything else looks consistent at least

dt = data.table(a = 1:3) DT[, b := list(NULL)] # warning, doesn't add the new column DT[, `:=`(b = list(NULL))] # works # same as data.table(a = 1:3, b = list(NULL)) # can be done with DT[, b := .(list(NULL))] # or DT[, b := list(list(NULL))]

Is adding a new null list column in standard form something we want to be able to do as well?

I'm not sure it's possible to do unambiguously. What's most important is that users can do what they need to, and there is a way to add "empty" list columns as you noted: b := .(list(NULL)). We'd only worry about continuing to support b := list(NULL) if there was a back-compatibility issue, which there's not in this case.

SGTM. How do you suppose we proceed with this issue? Are the revdep issues worth making for the sake of consistency? Primary goal for me is to close issues, so I'm open to suggestions 😸

How do you suppose we proceed with this issue?

Let's start with the proposed list of the various ways to add/remove columns for list/non-list types. It can also serve as a piece of documentation to add somewhere (maybe FAQ).

Hard to see, will leave as a separate comment

tdhock · 2024-06-12T02:39:01Z

i posted the issue, and i'm totally fine with not changing any features, and instead updating the docs, as long as they explain why this inconsistency exists (maybe there is some reason that I do not understand?) If there is no strong reason to keep existing functionality (other than revdeps), would be nice to increase consistency (reduce user surprise)

joshhwuu · 2024-06-12T20:07:19Z

Here's a table of current behavior and proposed changes, @MichaelChirico LMK if there's anything else you'd like to see. For most of these I added some tests (although I trust most of it has been thoroughly tested by previous tests, ie adding, removing, etc.).

TLDR:

Replacement of a single-row list column with list(NULL) replaces the column with an empty list, instead of deleting the column (new, consistent with data.frame).
Replacement of a multi-row list column is the same except RHS can now be just list(NULL) (or we don't have to do list(NULL, NULL, ...) (new, consistent with data.frame).
Adding a new list column isn't changed at all.
To remove a list column, we now can only assign the list column to NULL and no longer have the option to use list(NULL).
Replacement/addition/removal of non-list types with list(NULL) doesn't work as expected (throws an error, same as now).

Type	Old	Proposed changes
Replacement of a list column in a single-row data.table with `list(NULL)` with $ or standard form with :=.	# Deletes list column DT = data.table(L = list('A'), i = 1) DT$L = list(NULL) # or DT[, L := list(NULL)] # i # <num> # 1: 1	# Replaces list column with empty list, consistent with data.frame # and consistent with current functional form DT = data.table(L = list('A'), i = 1) DT$L = list(NULL) # or DT[, L := list(NULL)] # or DT[, `:=`(L = list(NULL))] # before AND after PR the same # i L # <num> <list> # 1: 1 [NULL]
Replacement of a list column in a multi-row data.table with `list(NULL)` with $ or with :=.	# Replaces list column with empty list DT = data.table(L = list('A', 'B'), i = 1) DT$L = list(NULL, NULL) # or DT[, L := list(NULL, NULL)] # or DT[, `:=`(L = list(NULL))] # i L # <num> <list> # 1: 1 [NULL] # 2: 1 [NULL]	# Does the same thing, but removes the need to specify "NULL" for every row DT = data.table(L = list('A', 'B'), i = 1) DT$L = list(NULL) # or DT[, L := list(NULL)] # or DT[, `:=`(L = list(NULL))] # Again, same before and after # i L # <num> <list> # 1: 1 [NULL] # 2: 1 [NULL]
Adding an empty list column to a single-row data.table.	# Use a plonk when necessary DT = data.table(L = list('A'), i = 1) DT$D = list(list(NULL)) # or DT[, D := .(list(NULL))] # or DT[, `:=`(D = list(NULL))] # L i D # <list> <num> <list> # 1: A 1 [NULL]	No changes here!
Adding an empty list column to a multi-row data.table.	# Use a plonk when necessary DT = data.table(L = list('A', 'B'), i = 1) DT$D = list(list(NULL)) # or DT[, D := .(list(NULL))] # or DT[, `:=`(D = list(NULL))] # L i D # <list> <num> <list> # 1: A 1 [NULL] # 2: C 1 [NULL]	Again, no changes here!
Removal of a list column in a single-row data.table.	# Removes the list column with either NULL # or list(NULL) (unless functional form) DT = data.table(L = list('A'), i = 1) DT$L = NULL # or list(NULL) # or DT[, L := NULL] # or list(NULL) # or DT[, `:=`(L = NULL)] # i # <num> # 1: 1	# Removes the list column with NULL # setting to list(NULL) replaces column with empty list DT = data.table(L = list('A'), i = 1) DT$L = NULL # or DT[, L := NULL] # or DT[, `:=`(L = NULL)] # i # <num> # 1: 1

tdhock · 2024-06-13T15:11:14Z

wow that is a really great comparison table

Anirban166 · 2024-06-13T20:25:28Z

Agreed, that's pretty comprehensive. @joshhwuu good work!

tdhock

looks good to me, it is better to have only one way of doing something, if possible, in my opinion, and this PR moves toward that ideal.
let's wait to see what Michael says.

MichaelChirico · 2024-06-17T05:47:45Z

Agree it's a great table! I want to read it again carefully -- hope to find the time this week. So far, I agree it looks like improved behavior.

joshhwuu · 2024-06-19T01:29:20Z

Hmm.. It seems that there's been an oversight on my end. While revisiting the code/documentation change again, I realized that since we know that the functional form wraps RHS in a list, SEXP assign will interpret this as a null replacement instead of deletion. I tested and it seems that I was right:

> DT = data.table(L = list('A'), i = 1)
> DT[, `:=`(L = NULL)]
> DT
#         L     i
#    <list> <num>
# 1: [NULL]     1

I think this can be fixed, but I'll need some time to think of a good solution, suggestions are welcome. I'll be reorganizing the unit tests to be more comprehensive and use all forms of assignment to thoroughly test. Thanks for everyone's patience!

joshhwuu · 2024-06-20T00:19:52Z

Organized and added some tests, changed list wrapping behavior of rhs to not wrap (functional form only) when rhs is a singular NULL, thus allowing us to remove columns by assigning to NULL with functional form, as listed in the table above.

…ement

tdhock · 2024-06-20T00:46:19Z

looks good to me, thanks for the extensive tests

joshhwuu added 2 commits June 3, 2024 17:14

Changed replacement behavior of list columns

a1a2a4d

use allocNAVector() instead

0a5e1b7

joshhwuu requested review from tdhock and Anirban166 June 4, 2024 00:59

joshhwuu requested review from HughParsonage and MichaelChirico as code owners June 4, 2024 00:59

joshhwuu mentioned this pull request Jun 4, 2024

Master List of data.table Issues for GSoC '24 (Josh) joshhwuu/gsoc-2024#1

Open

11 tasks

HughParsonage approved these changes Jun 4, 2024

View reviewed changes

tdhock requested changes Jun 4, 2024

View reviewed changes

add news entry

7fc9d31

Anirban166 approved these changes Jun 4, 2024

View reviewed changes

comments on tests

4469520

let tests for thoroughness

e0c050c

joshhwuu added 4 commits June 7, 2024 12:31

updated vignette

b04aad9

better

e1a8d78

more

37c4ccc

typo

c69bf06

joshhwuu added 2 commits June 7, 2024 14:33

new line of thinking

810c62f

typo

3583912

joshhwuu added 5 commits June 7, 2024 23:24

added example to vignette

42a61ca

Merge branch 'master' into consistentcolreplacement

535eed1

assign docs

72c9175

Merge branch 'consistentcolreplacement' of https://github.com/Rdatata…

cbd60e0

…ble/data.table into consistentcolreplacement

slight changes

599f0e2

tdhock reviewed Jun 8, 2024

View reviewed changes