Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify T/F versus TRUE/FALSE usage in j argument with new FAQ entry* #6196

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

Nj221102
Copy link
Contributor

@Nj221102 Nj221102 commented Jun 20, 2024

closes #4846

Description

This pull request adds a new FAQ entry in data.table.faq.Rmd to clarify the behavior difference between T/F and TRUE/FALSE when used in the j argument of data.table. The FAQ entry explains why it's recommended to use TRUE and FALSE over T and F to avoid unexpected behavior.

@Anirban166
Copy link
Member

Does T/F have any utility besides being an abbreviated alias of TRUE/FALSE? (I'm wondering if they were introduced for some reason aside from the shorthand benefit)

@Nj221102
Copy link
Contributor Author

Nj221102 commented Jun 20, 2024

Does T/F have any utility besides being an abbreviated alias of TRUE/FALSE? (I'm wondering if they were introduced for some reason aside from the shorthand benefit)

don't know much about them other than them being abbreviation but here is some info i found online

Historical Context: In early versions of S (the precursor to R), T and F were introduced as constants representing TRUE and FALSE, respectively. This convention has been carried over into R for consistency and familiarity with users who have experience with S.

Compatibility: Some packages or legacy code might still use T and F instead of TRUE and FALSE. Therefore, maintaining compatibility with these conventions ensures that older code continues to work seamlessly in newer R environments.

@@ -460,6 +460,10 @@ Happily, an internet search for "How does R method dispatch work" (at the time o

However, features like basic S3 dispatch (pasting the function name together with the class name) is why some R folk love R. It's so simple. No complicated registration or signature is required. There isn't much needed to learn. To create the `merge` method for data.table all that was required, literally, was to merely create a function called `merge.data.table`.

## Why do `T` and `F` behave differently from `TRUE` and `FALSE` in `data.table`?

In R, `T` and `F` are global variables that default to `TRUE` and `FALSE`, respectively. However, they can be redefined in the user environment, leading to unexpected behavior. The `data.table` package might treat `T` and `F` as variable names rather than logical constants. To avoid this issue, always use `TRUE` and `FALSE` for logical indexing and column selection in `data.table`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logical indexing -> logical subsetting?
also the user could re-define TRUE and FALSE, so I think saying that T and F default to TRUE and FALSE, and can be redefined is misleading. better to say anything can be redefined in the user environment? but this is not specific to data table so I'm not sure that any FAQ is even needed
@jangorecki can you please review since you commented on the original issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok to me

@Anirban166
Copy link
Member

Historical Context: In early versions of S (the precursor to R), T and F were introduced as constants representing TRUE and FALSE, respectively. This convention has been carried over into R for consistency and familiarity with users who have experience with S.

Compatibility: Some packages or legacy code might still use T and F instead of TRUE and FALSE. Therefore, maintaining compatibility with these conventions ensures that older code continues to work seamlessly in newer R environments.

It looks like it is something not just for data.table then (since it applies to R or other packages in general), in which case it probably doesn't warrant an FAQ entry.

@Nj221102
Copy link
Contributor Author

Nj221102 commented Jun 20, 2024

Historical Context: In early versions of S (the precursor to R), T and F were introduced as constants representing TRUE and FALSE, respectively. This convention has been carried over into R for consistency and familiarity with users who have experience with S.
Compatibility: Some packages or legacy code might still use T and F instead of TRUE and FALSE. Therefore, maintaining compatibility with these conventions ensures that older code continues to work seamlessly in newer R environments.

It looks like it is something not just for data.table then (since it applies to R or other packages in general), in which case it probably doesn't warrant an FAQ entry.

In discussion of issue #4846 decision was to add this as faq, Hi @MichaelChirico WDYT ?

if this is not needed anymore, i can close this PR

@@ -460,6 +460,10 @@ Happily, an internet search for "How does R method dispatch work" (at the time o

However, features like basic S3 dispatch (pasting the function name together with the class name) is why some R folk love R. It's so simple. No complicated registration or signature is required. There isn't much needed to learn. To create the `merge` method for data.table all that was required, literally, was to merely create a function called `merge.data.table`.

## Why do `T` and `F` behave differently from `TRUE` and `FALSE` in `data.table`?

In R, `T` and `F` are global variables that default to `TRUE` and `FALSE`, respectively. However, they can be redefined in the user environment, leading to unexpected behavior. The `data.table` package might treat `T` and `F` as variable names rather than logical constants. To avoid this issue, always use `TRUE` and `FALSE` for logical subsetting and column selection in `data.table`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's too many words. I would simply reproduce the example from the cited issue, then quickly explain what's happening and give the recommendation to avoid T/F shorthand in general.

We might even give a shoutout to T_and_F_symbol_linter() 🙃

@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 20, 2024

It looks like it is something not just for data.table then (since it applies to R or other packages in general), in which case it probably doesn't warrant an FAQ entry.

@Anirban166 no, not quite -- read the linked issue. Consider mtcars[,1:3][, c(T,F,T)], which will "work as expected" for data.frames. The issue is data.table query analysis does not determine that c(T,F,T) is a logical query, IINM because typeof(substitute(T)) and typeof(substitute(TRUE)) are different and we'd have to evaluate T to see it's TRUE.

Moreover data.table scoping makes things complicated -- consider DT=data.table(T = 1, F = 2), what should DT[, c(T, F)] return? We'd rather not special-case these ill-advised symbols T/F.

@Anirban166
Copy link
Member

@Anirban166 no, not quite -- read the linked issue. Consider mtcars[,1:3][, c(T,F,T)], which will "work as expected" for data.frames. The issue is data.table query analysis does not determine that c(T,F,T) is a logical query, IINM because typeof(substitute(T)) and typeof(substitute(TRUE)) are different and we'd have to evaluate T to see it's TRUE.

Moreover data.table scoping makes things complicated -- consider DT=data.table(T = 1, F = 2), what should DT[, c(T, F)] return? We'd rather not special-case these ill-advised symbols T/F.

I think it's reasonable that data.table doesn't automatically recognize T and F as representing TRUE and FALSE, both because of how scoping works and in general. Yes, they would be columns in DT and not logical values, but that's what one should generally expect imo. I don't think that using those single letter variable names just to represent a shorthand of boolean values is a good practice or something that others should expect from a package to follow up on if that makes sense.

Copy link
Member

@Anirban166 Anirban166 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all good with adding this FAQ entry for that purpose, and I think it would be helpful for those who stick to T/F. I just personally don't see much point in their existence in R, as sticking with TRUE or FALSE when needed sounds reasonable and good enough (in terms of typing convenience) to me. (In other words, people shouldn't generally use them even without us telling them I feel)

vignettes/datatable-faq.Rmd Outdated Show resolved Hide resolved
vignettes/datatable-faq.Rmd Outdated Show resolved Hide resolved
@MichaelChirico
Copy link
Member

I'm all good with adding this FAQ entry, and I think it would be helpful for those who stick to T/F. I just personally don't see much point in their existence in R, as sticking with TRUE or FALSE when needed sounds reasonable and good enough (in terms of typing convenience) to me. (in other words, people shouldn't generally use them without us telling them I feel)

Agreed, but it's definitely a gotcha for beginners. Heaven knows why but tons of newbies (including myself!) start out writing R with T and F everywhere, and it's all over CRAN packages too-- we're stuck with it :(

Comment on lines +499 to +504
#> 1: a 1
#> 2: a 3
#> 3: a 6
#> 4: b 1
#> 5: b 3
#> 6: b 6
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete

When you use `TRUE` and `FALSE` in `DT[, .SD, .SDcols = c(TRUE, TRUE, FALSE)]`, `data.table` correctly identifies them as logical constants. This selects the first two columns (`x` and `y`) and excludes the third column (`v`).

2. **Using `T`/`F` leads to unexpected behavior:**
When you use `T` and `F` in `DT[, .SD, .SDcols = c(T, T, F)]`, `data.table` treats `T` and `F` as variable names rather than logical constants. Since these variables are not defined within the `data.table` scope, it results in incorrect behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not make sense to me, because the variable names are x, y, v, not T, F

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TRUE/FALSE and abbreviated variants T/F have different behavior when used in j
5 participants