Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turn autoread() into a wrapper for rio::import() #1

Open
bokov opened this issue Sep 18, 2019 · 0 comments
Open

Turn autoread() into a wrapper for rio::import() #1

bokov opened this issue Sep 18, 2019 · 0 comments
Labels
blocked Waiting for something outside the project enhancement New feature or request

Comments

@bokov
Copy link
Owner

bokov commented Sep 18, 2019

rio::import() does what I've been trying to do but better. I can switch to that, but add something they don't do: empirical format detection. Basically, file extensions are not a reliable indicator of format. Some extensions are ambiguous or missing altogether. Some users/apps create files with incorrect extensions. Even magic number snooping isn't foolproof-- for example, ODS and XLSX have the same magic numbers. The true way to determine if a file has a particular format is to try it and see what happens-- or as a Python programmer might say "ask for forgiveness instead of for permission".

The wrapper I want to write (or the rio team if they like the idea and beat me to it) will do some input validation/adjustment and then try to run rio::import() taking the format value from the file as per its default. But then, if it fails, there will be a vector of supported formats (ideally sorted by how commonly used they are) and we will iterate over this vector, calling rio::import() again using each format in turn until it either succeeds or all formats have been tried. So yes, this is a brute force approach, but mis-specified read attempts tend to fail before they spend a lot of time reading the file, so it doesn't take that long to converge on the right format by trial and error. In use cases where you're mass-processing a collection of files or trying to build an analysis pipeline where you don't know the file format in advance, this tradeoff is worth it.

I might have the function either use message() or an attribute to record what format turned out to be correct, so the user will know for next time.

First, though, we need more standardization in the arguments. @leeper already does this for the which parameter, and someday I'd like to contribute similar mapping for other general concepts in reading tabular data including skip, nrows, na. But until then, we do need to at least have the underlying read functions not get passed arguments that will make them error. I believe I have a pretty clean solution to this in my pull request. If things go well with that, I'll submit similar PRs for other common readers where this happens, and then I'll be ready to replace my autoread() with rio::import().

Note to self: one student has an .rda file as input, so once I resolve this ticket, I should notify them.

@bokov bokov added enhancement New feature or request blocked Waiting for something outside the project labels Sep 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked Waiting for something outside the project enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant