Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread blank.lines.skip gains new value "none" #6158

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft

Conversation

joshhwuu
Copy link
Member

@joshhwuu joshhwuu commented May 29, 2024

WIP towards #5611

Closes #5611

Currently, fread() starts reading at the first non-empty line of the file, regardless of blank.lines.skip, this aims to add a new value for blank.lines.skip, allowing users to skip "none" of the blank lines (read all lines of the file).

old:

file <- tempfile()
writeLines(c("", "", NA), file)
data.table::fread(file, blank.lines.skip = FALSE)
#>        V1
#>    <lgcl>
#> 1:     NA

new:

file <- tempfile()
writeLines(c("", "", NA), file)
data.table::fread(file, blank.lines.skip = "none")
#>        V1
#>    <lgcl>
#> 1:     NA
#> 2:     NA
#> 3:     NA

This does mean that users should be able to read in fully white/empty files if blank.lines.skip == "none", at least that behavior made sense to me.

Marking this as WIP as there are two lines I'm unsure of:

ASSERT(lastncol>0, "first non-empty row should always be at least one field; %c %d", sep, quoteRule); // # nocov

if (!fill && tt!=ncol) STOP(_("Internal error: first line has field count %d but expecting %d"), tt, ncol); // # nocov

these two assertions don't work with the changes for some reason I'm not so sure about, so I got around them by only having the assertions when blank.lines.skip != "none", but this feels a bit hacky and I'm not sure what kind of error these two assertions are supposed to check for. If someone is able to help me out with these that would be greatly appreciated!

TODO:

  • More tests
  • Figure out assertion issues
  • Documentation + cleanup

@joshhwuu
Copy link
Member Author

@ben-schwen Could you possibly take a look at this when you have time? Not sure if this was what you had in mind but it made sense to me, let me know if you have any suggestions or if you can provide some insight on the issue I have rn, thanks!

Copy link

github-actions bot commented May 29, 2024

Comparison Plot

Generated via commit 6c10807

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 12 minutes and 36 seconds

Time taken to run atime::atime_pkg on the tests: 3 minutes and 18 seconds

@Anirban166
Copy link
Member

ASSERT(lastncol>0, "first non-empty row should always be at least one field; %c %d", sep, quoteRule); // # nocov

if (!fill && tt!=ncol) STOP(_("Internal error: first line has field count %d but expecting %d"), tt, ncol); // # nocov

these two assertions don't work with the changes for some reason I'm not so sure about, so I got around them by only having the assertions when blank.lines.skip != "none", but this feels a bit hacky and I'm not sure what kind of error these two assertions are supposed to check for. If someone is able to help me out with these that would be greatly appreciated!

Thoughts regarding the first assertion - Since fread is not skipping empty lines and we are dealing with blank data for the first two rows, it might be that it is somehow considering the first blank line as a valid row (bug?), in which case the number of fields in it are zero (causing the first assertion to fail)

Btw, the variable names seem to be a bit confusing with regards to the error message (I'd expect lastncol to be something like the number of columns in the last non-empty row that has been read for example)

@Anirban166
Copy link
Member

Also, is the option blank.lines.skip primarily for performance improvement? (since irrespective of it being TRUE or FALSE I get the same output for the 'old' code example)

@joshhwuu
Copy link
Member Author

Thoughts regarding the first assertion - Since fread is not skipping empty lines and we are dealing with blank data for the first two rows, it might be that it is somehow considering the first blank line as a valid row (bug?), in which case the number of fields in it are zero (causing the first assertion to fail)

This was my theory too

Also, is the option blank.lines.skip primarily for performance improvement? (since irrespective of it being TRUE or FALSE I get the same output for the 'old' code example)

Not sure what the intended purpose is but I believe blank.lines.skip doesnt count NA as blank lines since its an entry, and with the old behavior (skipping all leading empty rows by default) either true or false produces the same result, try something like:

file <- tempfile()
writeLines(c("", "", NA, "", "a"), file)

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 4, 2024

Test 2264.1 produced 1 errors but expected 0
 Expected: 
 Observed: Internal error: first line has field count 0 but expecting 1
 Test 2264.2 produced 1 errors but expected 0
 Expected: 
 Observed: Internal error: first line has field count 0 but expecting 1
 Test 2264.3 produced 1 errors but expected 0
 Expected: 
 Observed: Internal error: first line has field count 0 but expecting 1

Looks like both assertions are very similar, failing because they expect the first line to have at least one field. I wonder what part of fread logic requires the first row to be non-empty. Perhaps for column writing?

@joshhwuu
Copy link
Member Author

joshhwuu commented Jun 4, 2024

Also, is the option blank.lines.skip primarily for performance improvement? (since irrespective of it being TRUE or FALSE I get the same output for the 'old' code example)

@Anirban166, looks like we skip blank lines whenever ncol > 1, not sure why but I'm getting the similar output for both true and false, except when blank.lines.skip=TRUE we keep the column names.

f = tempfile()
input = "a,b\n\n\n1,3\n2,5\n4,6"
writeLines(input, f)
fread(f, blank.lines.skip=T)
#    a b
# 1: 1 3
# 2: 2 5
# 3: 4 6

fread(f, blank.lines.skip=F)
#    V1 V2
# 1:  1  3
# 2:  2  5
# 3:  4  6

input = "a\n\n\n1\n2\n4"
writeLines(input, f)
print(fread(f, blank.lines.skip=F))
#     a
# 1: NA
# 2: NA
# 3:  1
# 4:  2
# 5:  4

Not sure if this is intended. If this is the case, that means the changes are performing as they should, as in not skipping beginning lines when we have 1 column, and same output as blank.lines.skip=FALSE whenever we have more than 1 column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

blank.lines.skip won't work when the top lines is empty
2 participants