Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read space delimited file where missing values are blank "". #3658

Open
zx8754 opened this issue Jun 25, 2019 · 9 comments · May be fixed by #6815
Open

Read space delimited file where missing values are blank "". #3658

zx8754 opened this issue Jun 25, 2019 · 9 comments · May be fixed by #6815

Comments

@zx8754
Copy link

zx8754 commented Jun 25, 2019

Example input text file - fileTest.txt:

c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0   3 4
r3 0 1 2 3 4

Adding a screenshot to show there are 3 spaces on row r2 between 0 and 3, i.e.: values are missing for c3 and c4.

image

Using R version 3.4.0 and data.table 1.10.4 (Session info 1), below works as expected:

> fread("fileTest.txt", strip.white = FALSE)
   c1 c2 c3 c4 c5 c6
1: r1  0  1  2  3  4
2: r2  0 NA NA  3  4
3: r3  0  1  2  3  4

But fails with R version 3.5.2 and data.table_1.12.2. (Session info 2).

> fread("fileTest.txt", strip.white = FALSE)
   c1 c2 c3 c4 c5 c6
1: r1  0  1  2  3  4
Warning message:
In fread("fileTest.txt", strip.white = FALSE) :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>

Other attempts, all failed:

fread("fileTest.txt")
fread("fileTest.txt", sep = " ")
fread("fileTest.txt", strip.white = FALSE)
fread("fileTest.txt", sep = " ", strip.white = FALSE, na.strings = c("NA", ""))
fread("fileTest.txt", sep = " ", strip.white = FALSE, quote = "")

Note

read.table("fileTest.txt", strip.white = FALSE, sep = " ", na.strings = "")

Session info 1

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /opt/gridware/apps/intel/compilers_and_libraries_2016.3.210/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] data.table_1.10.4

loaded via a namespace (and not attached):
[1] compiler_3.4.0

Session info 2

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.2

loaded via a namespace (and not attached):
[1] compiler_3.5.2 tools_3.5.2   
@amarron314
Copy link

amarron314 commented Sep 24, 2019

Is there any news on this / any workaround without resorting to the slow read.table? I am also trying to read a large space separated file in R and this problem is so frustrating (especially since your SO post describes how they had a solution then made it stop working). What makes it worse is that the file is gzipped, and even read.table seems to have trouble with the double-space containing line.

@mehrlander
Copy link

I just found this, with the same problem. It's interesting as I saw noted that two consecutive commas are correctly interpreted to imply a blank value between, but two consecutive spaces are not.

@dvg-p4
Copy link
Contributor

dvg-p4 commented Sep 16, 2022

This is still a problem in the latest version, 1.14.2

@privefl
Copy link

privefl commented Feb 1, 2024

Having the same issue with v1.14.8

@tdhock
Copy link
Member

tdhock commented Feb 1, 2024

confirming, this is still an issue with current R-devel and data.table-1.15.0.
Here is R code to reproduce

text <- "c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0   3 4
r3 0 1 2 3 4"
read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
data.table::fread(text=text, strip.white=FALSE)

here are the results on my system:

> text <- "c1 c2 c3 c4 c5 c6
+ r1 0 1 2 3 4
+ r2 0   3 4
+ r3 0 1 2 3 4"
> read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
  V1 V2   V3   V4 V5 V6
1 c1 c2   c3   c4 c5 c6
2 r1  0    1    2  3  4
3 r2  0 <NA> <NA>  3  4
4 r3  0    1    2  3  4
> data.table::fread(text=text, strip.white=FALSE)
       c1    c2    c3    c4    c5    c6
   <char> <int> <int> <int> <int> <int>
1:     r1     0     1     2     3     4
Warning message:
In data.table::fread(text = text, strip.white = FALSE) :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>
> sessionInfo()
R Under development (unstable) (2024-01-23 r85822 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Phoenix
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.4.0    tools_4.4.0       data.table_1.15.0
> 

@MichaelChirico
Copy link
Member

confirming, this is still an issue with current R-devel and data.table-1.15.0. Here is R code to reproduce

text <- "c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0   3 4
r3 0 1 2 3 4"
read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
data.table::fread(text=text, strip.white=FALSE)

here are the results on my system:

> text <- "c1 c2 c3 c4 c5 c6
+ r1 0 1 2 3 4
+ r2 0   3 4
+ r3 0 1 2 3 4"
> read.table(text=text, strip.white = FALSE, sep = " ", na.strings = "")
  V1 V2   V3   V4 V5 V6
1 c1 c2   c3   c4 c5 c6
2 r1  0    1    2  3  4
3 r2  0 <NA> <NA>  3  4
4 r3  0    1    2  3  4
> data.table::fread(text=text, strip.white=FALSE)
       c1    c2    c3    c4    c5    c6
   <char> <int> <int> <int> <int> <int>
1:     r1     0     1     2     3     4
Warning message:
In data.table::fread(text = text, strip.white = FALSE) :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>
> sessionInfo()
R Under development (unstable) (2024-01-23 r85822 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: America/Phoenix
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.4.0    tools_4.4.0       data.table_1.15.0
> 

what about na.strings=" "?

@dvg-p4
Copy link
Contributor

dvg-p4 commented Feb 1, 2024

@MichaelChirico

what about na.strings=" "?

That doesn't correspond to the desired result here. Spaces should be delimiters, not NAs--the lack of a value between spaces should be interpreted as an NA.

And in any case, trying it regardless just yields different errors:

> data.table::fread(text=text, strip.white=FALSE, na.strings = " ")
Error in data.table::fread(text = text, strip.white = FALSE, na.strings = " ") : 
  na.strings[1]==" " consists only of whitespace, ignoring. But strip.white=FALSE. Use strip.white=TRUE (default) together with na.strings="" to turn any number of spaces in string columns into <NA>
> data.table::fread(text=text, strip.white=TRUE, na.strings = " ")
   c1 c2 c3 c4 c5 c6
1: r1  0  1  2  3  4
Warning messages:
1: In data.table::fread(text = text, strip.white = TRUE, na.strings = " ") :
  na.strings[1]==" " consists only of whitespace, ignoring. Since strip.white=TRUE (default), use na.strings="" to specify that any number of spaces in a string column should be read as <NA>.
2: In data.table::fread(text = text, strip.white = TRUE, na.strings = " ") :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>

@dvg-p4
Copy link
Contributor

dvg-p4 commented Feb 1, 2024

Oh I see, you probably meant na.strings = "" which was missing from tdhock's example. Still doesn't work:

> packageVersion("data.table")
[1] ‘1.15.0> text <- "c1 c2 c3 c4 c5 c6
+ r1 0 1 2 3 4
+ r2 0   3 4
+ r3 0 1 2 3 4"
> data.table::fread(text=text, strip.white=FALSE, na.strings = "")
       c1    c2    c3    c4    c5    c6
   <char> <int> <int> <int> <int> <int>
1:     r1     0     1     2     3     4
Warning message:
In data.table::fread(text = text, strip.white = FALSE, na.strings = "") :
  Stopped early on line 3. Expected 6 fields but found 4. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<r2 0   3 4>>

Pasteable:

text <- "c1 c2 c3 c4 c5 c6
r1 0 1 2 3 4
r2 0   3 4
r3 0 1 2 3 4"

data.table::fread(text=text, strip.white=FALSE, na.strings = "")

@dvg-p4
Copy link
Contributor

dvg-p4 commented Feb 12, 2025

This issue was introduced in data.table 1.11.0 specifically.

@dvg-p4 dvg-p4 linked a pull request Feb 12, 2025 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants