Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of multiline strings #1491

Open
rmgpanw opened this issue Apr 17, 2023 · 2 comments
Open

Improve handling of multiline strings #1491

rmgpanw opened this issue Apr 17, 2023 · 2 comments
Labels
feature a feature request or enhancement read 📖

Comments

@rmgpanw
Copy link

rmgpanw commented Apr 17, 2023

Hi, I was recently reading a table into R using the readr package but discovered that several rows were missing as some cells included a single double quotation mark.

I managed to resolve this by setting quote = "", however I wonder if the read_tsv()/related functions could be updated to at least raise a warning if improper quoting is discovered in the data? The data.table package both raises a warning and deals with this automatically, as per the following reprex.

Many thanks for considering!

library(tibble)
library(readr)
library(data.table)

# create example file
df <- tribble(
  ~code, ~description,
  "1", '"Single quotes',
  "2", "No quotes",
  "3", 'Single quotes"',
  "4", '"Pair of quotes"',
  "5", "No quotes"
)

file_path <- tempfile(pattern = "df_with_quotes", fileext = ".tsv")

write_tsv(df,
  file_path,
  quote = "none",
  escape = "none"
)

# `readr::read_tsv()` reads inappropriately without raising an error with
# default settings
read_tsv(file_path,
  quote = "\""
)
#> Rows: 3 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): description
#> dbl (1): code
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 3 × 2
#>    code description                                    
#>   <dbl> <chr>                                          
#> 1     1 "Single quotes\n2\tNo quotes\n3\tSingle quotes"
#> 2     4 "Pair of quotes"                               
#> 3     5 "No quotes"

# To read correctly use the `quote` argument
read_tsv(file_path,
  quote = ""
)
#> Rows: 5 Columns: 2
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (1): description
#> dbl (1): code
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5 × 2
#>    code description         
#>   <dbl> <chr>               
#> 1     1 "\"Single quotes"   
#> 2     2 "No quotes"         
#> 3     3 "Single quotes\""   
#> 4     4 "\"Pair of quotes\""
#> 5     5 "No quotes"

# `data.table::fread()` raises a warning but reads the table correctly with
# default settings
fread(file_path,
  quote = "\""
)
#> Warning in fread(file_path, quote = "\""): Found and resolved improper quoting
#> in first 100 rows. If the fields are not quoted (e.g. field separator does not
#> appear within any field), try quote="" to avoid this warning.
#>    code    description
#> 1:    1 "Single quotes
#> 2:    2      No quotes
#> 3:    3 Single quotes"
#> 4:    4 Pair of quotes
#> 5:    5      No quotes

# setting `quote=""` produces (almost) the same result, now without a warning
fread(file_path,
  quote = ""
)
#>    code      description
#> 1:    1   "Single quotes
#> 2:    2        No quotes
#> 3:    3   Single quotes"
#> 4:    4 "Pair of quotes"
#> 5:    5        No quotes

sessionInfo()
#> R version 4.2.0 (2022-04-22)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur/Monterey 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] data.table_1.14.6 readr_2.1.4       tibble_3.1.8     
#> 
#> loaded via a namespace (and not attached):
#>  [1] pillar_1.8.1      compiler_4.2.0    R.methodsS3_1.8.2 R.utils_2.12.2   
#>  [5] tools_4.2.0       digest_0.6.31     bit_4.0.5         evaluate_0.20    
#>  [9] lifecycle_1.0.3   R.cache_0.16.0    pkgconfig_2.0.3   rlang_1.0.6      
#> [13] reprex_2.0.2      cli_3.6.0         rstudioapi_0.14   parallel_4.2.0   
#> [17] yaml_2.3.7        xfun_0.37         fastmap_1.1.0     withr_2.5.0      
#> [21] styler_1.9.1      knitr_1.42        fs_1.6.1          vctrs_0.5.2      
#> [25] hms_1.1.2         tidyselect_1.2.0  bit64_4.0.5       glue_1.6.2       
#> [29] R6_2.5.1          fansi_1.0.4       vroom_1.6.1       rmarkdown_2.20   
#> [33] purrr_1.0.1       tzdb_0.3.0        magrittr_2.0.3    htmltools_0.5.4  
#> [37] ellipsis_0.3.2    utf8_1.2.3        crayon_1.5.2      R.oo_1.25.0
@daranzolin
Copy link

I experienced the same issue with read_csv(). A single double quote preceding one of the values omitted the remaining records. Only the warning from fread() saved me:

image

@hadley
Copy link
Member

hadley commented Jul 31, 2023

Here's a somewhat more minimal reprex:

library(readr)

lines <- 'code,description
1,"x
2,-
3,y"'
path <- tempfile()
writeLines(lines, path)

# Allows quoted string to span multiple lines
read_csv(path, col_types = list())
#> # A tibble: 1 × 2
#>    code description  
#>   <dbl> <chr>        
#> 1     1 "x\n2,-\n3,y"

# OK: explicit quote works
read_csv(path, quote = "", col_types = list())
#> # A tibble: 3 × 2
#>    code description
#>   <dbl> <chr>      
#> 1     1 "\"x"      
#> 2     2 "-"        
#> 3     3 "y\""

# warns & treats lines as quoted
data.table::fread(path, quote = "\"")
#> Warning in data.table::fread(path, quote = "\""): Found and resolved improper
#> quoting in first 100 rows. If the fields are not quoted (e.g. field separator
#> does not appear within any field), try quote="" to avoid this warning.
#>    code description
#> 1:    1          "x
#> 2:    2           -
#> 3:    3          y"

Created on 2023-07-31 with reprex v2.0.2

@hadley hadley added feature a feature request or enhancement read 📖 labels Jul 31, 2023
@hadley hadley changed the title Improve how read_tsv() handles improper quoting Improve handling of multiline strings Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement read 📖
Projects
None yet
Development

No branches or pull requests

3 participants