-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce default_locale2()
#1445
Comments
Edit: This was/is a separate, albeit related, issue #1468. |
@phish108 has brought to my attention that institutions in Switzerland, like the Federal Statistical Office, publish at least some CSV files with semicolon as delimiter and dot as decimal sign. Here is one example (metadata) library(readr)
ch_url <- "https://www.bfs.admin.ch/bfsstatic/dam/assets/21765619/master"
read_lines(ch_url, n_max = 5)
#> [1] "\"TIME_PERIOD\";\"GEO\";\"POP\";\"ERWP\";\"ERWL\";\"UNIT_MEA\";\"OBS_VALUE\";\"OBS_CONFIDENCE\";\"OBS_STATUS\""
#> [2] "\"2010/2012\";\"101\";\"Total\";\"0\";\"Total\";\"pers\";12374.03;5.620;\"A\""
#> [3] "\"2010/2012\";\"101\";\"Total\";\"1\";\"Total\";\"pers\";28394.11;3.718;\"A\""
#> [4] "\"2010/2012\";\"101\";\"Total\";\"Total\";\"Total\";\"pers\";40768.14;3.074;\"A\""
#> [5] "\"2010/2012\";\"102\";\"Total\";\"0\";\"Total\";\"pers\";7288.64;7.321;\"A\"" another example (metadata) ch_url2 <- "https://www.web.statistik.zh.ch/ogd/data/KANTON_ZUERICH_375.csv"
read_lines(ch_url2, n_max = 5)
#> [1] "BFS_NR;GEBIET_NAME;THEMA_NAME;SET_NAME;SUBSET_NAME;INDIKATOR_ID;INDIKATOR_NAME;INDIKATOR_JAHR;INDIKATOR_VALUE;EINHEIT_KURZ;EINHEIT_LANG;"
#> [2] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1990;2.6;Mio.Fr.;Millionen Franken;"
#> [3] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1991;3.0;Mio.Fr.;Millionen Franken;"
#> [4] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1992;3.3;Mio.Fr.;Millionen Franken;"
#> [5] "1;Aeugst a.A.;Öffentliche Finanzen;Gemeindesteuern;Steuerkraft;375;Steuerkraft (arith. Mittel 3 Jahre) [Mio.Fr.];1993;3.6;Mio.Fr.;Millionen Franken;" Apparently this spec is not used consistently. Here is an example from the Statistical Office where they use comma as delimiter (metadata) ch_url3 <- "https://dam-api.bfs.admin.ch/hub/api/dam/assets/24106318/master"
read_lines(ch_url3, n_max = 5)
#> [1] "\"REGION\",\"CANTON\",\"PERIOD\",\"VALUE\",\"STATUS\",\"OBS_COEF\""
#> [2] "\"Total\",\"Total\",\"2023-01\",\"2.19208880770041\",\"A\",\"A\""
#> [3] "\"1\",\"Total\",\"2023-01\",\"3.47967832561423\",\"A\",\"A\""
#> [4] "\"1\",\"22\",\"2023-01\",\"3.49266747468504\",\"A\",\"A\""
#> [5] "\"1\",\"23\",\"2023-01\",\"2.97652113777818\",\"A\",\"A\"" But be that as it may, why not support semicolon as delimiter and dot as decimal sign with |
read_csv2()
flexible re decimal separatordefaul_locale2()
Should fix at same time as #1468 |
I’ve recently run into the
read_csv2()
with dot as decimal separator issue that’s been mentioned here several times before.Basically the problem is, that
read_csv2()
cannot handle csv files with semicolon as delimiter and the dot as decimal separator.The reply so far has been (paraphrasing) “This is not what read_csv2() is intended for, because you don’t have a comma as decimal separator. Use read_delim(delim =”;“) instead.”
I think this reasoning and its current implementation are problematic.
First, overriding the
locale()
setting is quite smelly (sorry, Hadley!)readr/R/read_delim.R
Lines 333 to 337 in 41447ce
Due to that,
read_csv2()
does not apply customdecimal_mark
andgrouping_mark
locale settings.Either the dot, as mentioned above, but also the
grouping_mark
is ignored withread_csv2()
:compare
In addition, the warning message is confusing (E.g. it is not clear that this is indeed an override) and easily overlooked in the other messaging (Can you spot it immediately above? It’s not obvious IMO even with this minimal column spec).
Even worse, this warning is always shown by default. See the
read_csv2()
example:This is because
default_locale()
impliesdecimal_mark = "."
, whichread_csv2()
then has to override and throw the warning. But this essentially means that thelocale = default_locale()
default inread_csv2()
is useless, because to get rid of the warning, you have to define a differentlocale()
anyway.Conceptually, it would be more sound to treat
read_csv2()
the same asread_csv()
andread_tsv()
:I.e. set the delimiter, but grant flexibility on the decimal separator.
E.g. with
read_tsv()
you can easily do the followingread_csv2()
would need a differentlocale
/decimal separator default of course (but would need that anyway, because the current default is useless). Either adefault_locale2()
orlocale(decimal_mark = ",")
locale(decimal_mark = ",", grouping_mark = ".")
.Is that something you would consider? Happy to draft a PR.
One last thing:
It’s not just “some European countries”. Half the world uses the comma as a decimal separator and while I don’t have any numbers on the semicolon as delimiter in CSVs, it is clear that it cannot be the comma in those countries. Maybe it’s just me, but frankly that line also sounds a tad dismissive to me - I know it’s not meant that way, but still.
Session info
The text was updated successfully, but these errors were encountered: