Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer as.Date() coercion to R level #6107

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Apr 29, 2024

Closes #6105

As seen in the tests, this might be a breaking change if any downstreams depend on the output being specifically Date (and not IDate). I am not sure it's possible -- inherits(., "Date") is still TRUE, and relevant methods should just back up to Date methods if not available for IDate. I lean towards just going ahead with this change unless revdeps finds anything concerning, WDYT @jangorecki?

cc also @HughParsonage who may have some extra insights.


One alternative I looked into that won't work is adding "Date" as an alias for "IDate" in the type hierarchy:

data.table/src/freadR.c

Lines 27 to 29 in d19bfef

static int typeSxp[NUT] = {NILSXP, LGLSXP, LGLSXP, LGLSXP, LGLSXP, LGLSXP, INTSXP, REALSXP, REALSXP, REALSXP, REALSXP, INTSXP, REALSXP, STRSXP, REALSXP, STRSXP };
static char typeRName[NUT][10]={"NULL", "logical", "logical", "logical", "logical", "logical", "integer", "integer64", "double", "double", "double", "IDate", "POSIXct", "character", "numeric", "CLASS" };
static int typeEnum[NUT] = {CT_DROP, CT_EMPTY, CT_BOOL8_N, CT_BOOL8_U, CT_BOOL8_T, CT_BOOL8_L, CT_INT32, CT_INT64, CT_FLOAT64, CT_FLOAT64_HEX, CT_FLOAT64_EXT, CT_ISO8601_DATE, CT_ISO8601_TIME, CT_STRING, CT_FLOAT64, CT_STRING};

The problem is that there are files where our rudimentary date parser doesn't detect the date, but as.Date() does, meaning fread.c returns a string --> the hierarchy is broken as there's an attempt to cast string->int32. See in particular this test:

https://github.com/Rdatatable/data.table/blob/d19bfef7026f25bb2d36c17879187d09bcb2b2c3/inst/tests/tests.Rraw#L11043

Copy link
Member Author

MichaelChirico commented Apr 29, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @MichaelChirico and the rest of your teammates on Graphite Graphite

@MichaelChirico MichaelChirico changed the title C-level changes needed Defer as.Date() coercion to R level Apr 29, 2024
@MichaelChirico

This comment was marked as outdated.

Copy link

github-actions bot commented May 1, 2024

Comparison Plot

Generated via commit 5fc89d7

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 11 minutes and 59 seconds

Time taken to run atime::atime_pkg on the tests: 4 minutes and 11 seconds

@jangorecki
Copy link
Member

Yes, let's see revdeps

@MichaelChirico MichaelChirico marked this pull request as ready for review May 3, 2024 19:21
.ci/atime/tests.R Outdated Show resolved Hide resolved
.ci/atime/tests.R Outdated Show resolved Hide resolved
setup = {
DT = data.table(date=.Date(sample(20000, N, replace=TRUE)))
tmp_csv = tempfile()
fwrite(DT, tmp_csv)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki I think we should run fread(DT) once here in the setup because of cacheing, right? Or do we need to run in a subprocess? Here the benchmark is really about what happens after the fread.c code is run, only care about differences emerging (1) in freadR.c and/or (2) fread.R post-processing. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(regardless, we see substantial improvement in all 3 cases already)

src/freadR.c Outdated
@@ -335,7 +335,7 @@ bool userOverride(int8_t *type, lenOff *colNames, const char *anchor, const int
type[i]=CT_STRING; // e.g. CT_ISO8601_DATE changed to character here so that as.POSIXct treats the date-only as local time in tests 1743.122 and 2150.11
SET_STRING_ELT(colClassesAs, i, tt);
}
} else {
} else if (type[i] != CT_ISO8601_DATE || tt != char_Date) {
type[i] = typeEnum[w-1]; // freadMain checks bump up only not down
if (w==NUT) SET_STRING_ELT(colClassesAs, i, tt);
}
Copy link
Member

@HughParsonage HughParsonage May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the consequence of a dangling else here? (As in, if the new condition evaluates to false, lines 339-340 would previously be evaluated.) I think a comment for this line would be useful here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the call-out, this is something I wasn't understanding well myself.

The key is we just skip updating type[i] for that case. We might consider rewriting this then as:

type[i] = (type[i] == CT_ISO8601_DATE && tt == char_Date) ? type[i] : typeEnum[w-1];

The "trouble" comes in the next line, because w==NUT in the relevant case here (i.e., "Date" is not a known class for C-side coercion).

So with the simple fix, we're back to R-side as.Date() coercion.

As noted, we might prefer that (retaining fully back-compatible Date output columns), but it'll require a corresponding change in the colClasses=list() branch.

So it's down again to whether the result of this PR should be "cols parsed as IDate and requested as Date are returned as IDate, fully avoiding any coercion" or "we now skip the middle step in IDate->char->Date coercion; for maximum efficiency, request IDate instead of Date".

Copy link
Member

@Anirban166 Anirban166 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance tests look great! (but please add the comments)

fwrite(DT, tmp_csv)
},
expr = data.table::fread(tmp_csv, colClasses=list(Date='date')),
Slow = "d19bfef7026f25bb2d36c17879187d09bcb2b2c3",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelChirico Can you please add a comment where the SHAs came from? (reviewing with Toby)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slow I pulled from master around when I filed this PR. Fast is the first commit of this PR. Is that worth documenting? Is there some better approach I should be aware of?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable enough to me, but I think @tdhock prefers to have comments mentioning that information beside those lines for consistency with how the other tests are documented or for others to know quickly why those commit SHAs are being used

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it easy to double check where the commit IDs come from, it is best if they can be referenced relative to a PR, like the existing comments, "last commit in PR#xyz (URL)" etc

MichaelChirico added a commit that referenced this pull request Jun 1, 2024
Towards #6105. This PR is a simple refactor to reduce code nesting / improve readability. It makes the actual implementation of #6105, #6107, much simpler.
Base automatically changed from fread-date-idate to master June 1, 2024 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants