Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blank.lines.skip won't work when the top lines is empty #5611

Open
Yunuuuu opened this issue Mar 10, 2023 · 12 comments · May be fixed by #6158
Open

blank.lines.skip won't work when the top lines is empty #5611

Yunuuuu opened this issue Mar 10, 2023 · 12 comments · May be fixed by #6158

Comments

@Yunuuuu
Copy link

Yunuuuu commented Mar 10, 2023

I want to keep the total line number. Following code should return 3 rows, but it always return 1 row

file2 <- tempfile()
writeLines(c("", "", NA), file2)
data.table::fread(file2, blank.lines.skip = FALSE)

image

@Yunuuuu
Copy link
Author

Yunuuuu commented Nov 3, 2023

Add a reprex, and use readLines() to see the difference

file <- tempfile()
writeLines(c("", "", NA), file)
readLines(file)
#> [1] ""   ""   "NA"
data.table::fread(file, blank.lines.skip = FALSE)
#>        V1
#>    <lgcl>
#> 1:     NA
sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libmkl_rt.so;  LAPACK version 3.8.0
#>
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Asia/Shanghai
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base
#>
#> loaded via a namespace (and not attached):
#>  [1] digest_0.6.33     styler_1.10.1     fastmap_1.1.1     xfun_0.39
#>  [5] magrittr_2.0.3    glue_1.6.2        R.utils_2.12.2    knitr_1.43
#>  [9] htmltools_0.5.5   rmarkdown_2.23    lifecycle_1.0.3   cli_3.6.1
#> [13] R.methodsS3_1.8.2 vctrs_0.6.3       reprex_2.0.2      data.table_1.14.9
#> [17] withr_2.5.0       compiler_4.3.1    R.oo_1.25.0       R.cache_0.16.0
#> [21] purrr_1.0.1       tools_4.3.1       evaluate_0.21     yaml_2.3.7
#> [25] rlang_1.1.1       fs_1.6.3

Created on 2023-11-03 with reprex v2.0.2

@ben-schwen
Copy link
Member

We should probably update the documentation that blank lines at the beginning of a file are always skipped until the first non-empty row is encountered. This is also mentioned by the verbose output (shortened it to highlight the important part).

library(data.table)
file2 <- tempfile()
writeLines(c("", "", "a"), file2)
fread(file2, blank.lines.skip = FALSE, verbose=TRUE)
#> [05] Skipping initial rows if needed
#>   Positioned on line 3 starting: <<a>>

@Yunuuuu
Copy link
Author

Yunuuuu commented Dec 31, 2023

It’s not the document, we want to completely turn off the skipping of blank lines

@ben-schwen
Copy link
Member

I get that but changing the behavior of blank.lines.skip = FALSE would probably break a lot of existing code so I doubt we will do that.

Possible changes I could imagine would be to introduce another argument or let blank.lines.skip accept other values to indicate you do not want to skip blank lines at the beginning.

@joshhwuu
Copy link
Member

Do we see a purpose of an additional argument to turn off the initial skipping of lines? Or is a documentation update fine to close this issue, shouldn't be difficult to implement regardless.

@ben-schwen
Copy link
Member

fread has already so many arguments so I wouldn't want to introduce a new argument for that specific case. What do you think blank.lines.skip=NULL for indicating this behavior? Other free value of NA seems off, although I would ultimately tend towards blank.lines.skip="none" or something similar

@joshhwuu joshhwuu linked a pull request May 29, 2024 that will close this issue
3 tasks
@joshhwuu
Copy link
Member

joshhwuu commented Jun 4, 2024

@ben-schwen I did some initial prototyping of a change and here's what I found:

It seems that to make this behavior work in the most intuitive way, (blank.lines.skip="none" doesn't skip any blank lines, for any number of columns) there's likely a lot more work involved as we have to rework how fread() does skipping blank lines with more than 1 column. Not sure if this one use case is enough to warrant this.

Currently, blank.lines.skip=FALSE is ignored if we have more than 1 column of data (not sure if this is documented):
#6158 (comment)

If we make it so blank.lines.skip="none" only works with 1 column which is consistent with current blank.lines.skip=FALSE, then current changes should be most of the way there, barring some more testing. This would solve the MRE posted with the original issue, but I'm pretty sure OP intends to use this feature with data that has more than 1 column, which wouldn't be covered.

ATP, I feel a documentation change is the way to go. If we want to cover the MRE from the issue, the change is almost there. WDYT?

@ben-schwen
Copy link
Member

Tough question. Files with multiple columns and complete blank lines are clearly misspecified, so it's hard to guess what the intended behavior should be.

Does it play nice with complete blank lines, multiple columns and fill=TRUE?

@joshhwuu
Copy link
Member

joshhwuu commented Jun 4, 2024

f = tempfile()
input = c("a,b", ",", ",")
writeLines(input, f)
fread(f, blank.lines.skip=FALSE, fill=TRUE)
#     a  b
# 1: NA NA
# 2: NA NA

Seems to work, I think fill is what we needed.

I could initialize fill=TRUE whenever blank.lines.skip="none"?

Edit:
The above does work but only because column headers are specified. If the first line is empty, fread doesn't have a good way of defining columns, which is why the default behavior is to skip blanks until we find a non-empty line so we can define columns. If I bypass the skipping of the first blanks, fread just assumes the file to have one column, see:

f = tempfile()
input = "\n\n\n\n1,3\n\n2,4\n\n"
writeLines(input, f)
fread(f, blank.lines.skip="none", fill=TRUE)
#     V1 
#  1: NA                              
#  2: NA                              
# ---                                 
#  8: NA                              
#  9: NA 

@joshhwuu
Copy link
Member

joshhwuu commented Jul 2, 2024

Following up on this issue -- I figured that skipping the initial blank lines isn't quite so trivial because AFAIU fread uses the first non-empty line to determine the columns of the result. Do we think this is still worth pursuing to handle this case?

@MichaelChirico
Copy link
Member

I want to keep the total line number.

Hmm, it seems fread() is not exactly suitable for this. For example, we will attempt to find free-text fields that include newlines but are otherwise properly quoted & can thus be understood as a "rectangular" table.

If we simply want to count the number of newlines, this SO Q&A look pretty good to me (and adjustments to respect \r-delimited lines are not too hard):

https://stackoverflow.com/a/23456450/3576984

As to desired fread() behavior... still not sure... what's the interaction with header=FALSE for the examples here?

@joshhwuu
Copy link
Member

joshhwuu commented Jul 3, 2024

what's the interaction with header=FALSE for the examples here?

f = tempfile()
input = c("a,b", ",", ",")
writeLines(input, f)
fread(f, blank.lines.skip=FALSE, fill=TRUE, header=FALSE)
       V1     V2
   <char> <char>
1:      a      b
2:              
3:              

input = "\n\n\n\n1,3\n\n2,4\n\n"
writeLines(input, f)
fread(f, blank.lines.skip="none", fill=TRUE, header=FALSE)
      V1
   <num>
1:    NA
2:    NA
3:    NA
4:    NA
5:   1.3
6:    NA
7:   2.4
8:    NA
9:    NA

fread(f, blank.lines.skip='none')
      V1    V2
   <int> <int>
1:     2     4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants