Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_delim could use a fill argument #1416

Closed
micwij opened this issue Aug 10, 2022 · 6 comments
Closed

read_delim could use a fill argument #1416

micwij opened this issue Aug 10, 2022 · 6 comments
Labels
reprex needs a minimal reproducible example

Comments

@micwij
Copy link

micwij commented Aug 10, 2022

I know the readr package is specifically designed for reading rectangular data. However I think it would be very useful if there was a fill argument (at least in read_delim), which fills cells in rows, where a specific column is not represented i.e. if rows have unequal length. Currently it appears to me that columns are created based on the first row that is read, even if more col_names are applied than this row has columns and if consecutive rows would have more columns based on the supplied delimiter. Both base::read.delim and data.table::fread have this argument, so it might be a worthwhile addition to read_delim to be able to read files, which are not perfectly rectangular. If I am not mistaken older versions of readr had some functionality like that (based on the script I am working on right now, which worked earlier, but parses differently now). I hope the issue I am mentioning becomes clear. If not, let me know and I can try to provide an example.

Thanks a lot for the consideration!

@sbearrows sbearrows added the reprex needs a minimal reproducible example label Aug 25, 2022
@sbearrows
Copy link
Contributor

Thanks for opening this issue! It'd be great if you could provide a reproducible example to help us understand.

@micwij
Copy link
Author

micwij commented Sep 13, 2022

Hi,

Sorry for the delay. I attached a small file and some code below (sorry I don't know how to work with markdown or reprex, but I hope this is okay as well). As mentioned this file is not rectangular (row 3 has an additional field separator). I would find it plausible if read_delim created an additional column which is just filled with NAs for rows where there is no data, but instead rows which have more fields, accumulate the data in the last generated column. It also recognizes that there are parsing issues. I realize that encountering this kind of data may be rare, but it did happen to me and I saw that other packages have such a fill argument, which helps with that case, so I thought I would mention it.

library(tidyverse)
tib <- read_delim("reprex_readr.txt", delim = "\t",
col_names = c("col1", "col2", "col3", "col4", "col5 is missing"))

problems(tib)

A tibble: 1 x 5

row   col expected  actual    file                                                


1 3 5 4 columns 5 columns C:/Users/wijesingha1055/Nextcloud/Documents/tRial a~

reprex_ex.txt

@sbearrows
Copy link
Contributor

You are correct that readr previously behaved differently. The readr package now defaults to using vroom, which has some marked performance advantages when reading in rectangular data. As you noted, readr/vroom uses the first row of data to generate scaffolding for the remaining data. Because this happens early on, it would require a major overhaul in order to gain this fill feature. This is a case where using other packages for files that are not quite rectangular, seems to be the consensus. Here is a related issue that you might peruse in case other solutions reveal themselves: #762

@jennybc
Copy link
Member

jennybc commented Sep 21, 2022

If I am not mistaken older versions of readr had some functionality like that (based on the script I am working on right now, which worked earlier, but parses differently now

One more thing I'll add that's relevant for folks with a script that used to work (with a previous version of readr) but now doesn't (with the current version) is that you can opt-in to the older behaviour.

This can be handy for keeping legacy code alive that you don't really want to touch or refactor.

You can use readr::with_edition(1) or readr::local_edition(1) for this and you can search readr's GitHub issues to find examples of usage.

https://readr.tidyverse.org/reference/with_edition.html

@ilikegitlab
Copy link

The package could be so much better with a little attention to robustness instead of performance. Real world data is often not perfect and a single line should not prevent read csv's usability. I've often encountered software writing out csv files with unequal columns, why waste bytes for NA/whitespace?

Interestingly, the "consensus" seems to be (I asked at least 4 people) that it is a thoroughly bad idea to sacrifice usability just for the sake of performance. So..your just plain wrong there (see how lame this sounds when reflected?)

Let's instead actually address the issue: When I specify the columns (number or names or type), why does vroom not make the scaffold based on my given colnames and types instead of trying to be "intelligent" about it. I bet you it would even be faster and it would put the the power to make it work on some not perfect data in the hands of the user (instead of basically forcing switching to plain read.csv or data.table. The latter is faster, yet robust. And it surely shouldn't be the goal to make readr obsolete).

So why not, in vroom's create_columns, use the length of col_names instead of basing it on the data (I agree a bit more thinking/design how it plays out under all conditions may be required, but it is hardly rocket science, and I care far less about accidentally have some empty cols rather than no cols at all). This might also have advantages when combining multiple files which may sometimes miss a few cols (i'd actually was looking for that a while ago, so now pondering if i should form vroom to make it work).

@D3SL
Copy link

D3SL commented Nov 13, 2023

I have to agree with ilikegitlab. I have a CSV with some slightly ragged data. In all previous versions it was trivial to use col_names to create a placeholder column and clean the data once read. As of version 2.1.4 that no longer works. I can't even force readr to read the data into the columns I give it.

This is the second major RStudioverse package to make a profoundly breaking change with apparently no warnings and no announcements. It's deeply concerning to see the de facto monopoly of the R world going in this direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

5 participants