Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write_csv creates unexpeted csv file if dataframe has columns with mixed encoding attribute. #1412

Closed
ujtwr opened this issue Jul 18, 2022 · 3 comments
Labels
encoding � reprex needs a minimal reproducible example

Comments

@ujtwr
Copy link

ujtwr commented Jul 18, 2022

if dataframe has columns with mixed encoding attribute, write_csv gerenates unexpected CSV file.

  1. create tibble with utf-8 (Japanese) column name.
library(tidyverse)

d1 <- tibble(
  id = seq(1, 1000)
) %>% 
  mutate(
    性別 = sample(x = c("男性","女性"), size = 1000, replace = TRUE),
    第1= rnorm(1000),
    第2= rnorm(1000),
    第3= rnorm(1000),
    第4= rnorm(1000),
    第5= rnorm(1000),
    第6= rnorm(1000),
    第7= rnorm(1000),
    )
  1. check column name encoding attribute and column contents encoding attribute
print(stringi::stri_enc_mark(colnames(d1)))
#> "ASCII"  "native" "native" "native" "native" "native" "native" "native" "native"

Encoding(colnames(d1))
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

## check encoding attribute of column contents includeing Japanese char
Encoding(d1$性別)
#>  [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" ...
  1. convert Japanese column name with "native" encoding attribute to row by pivot_longer
d2 <- d1 %>% 
  pivot_longer(cols=!c(id, 性別), names_to = "実施回", values_to = "スコア")
  1. check column name encoding attribute and column contents encoding attribute
print(stringi::stri_enc_mark(colnames(d2)))
#> [1] "ASCII"  "native" "UTF-8"  "UTF-8" 

# second column created by mutate step
print(stringi::stri_enc_mark(d2$性別))
#>  [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"...

# third column converted by pivot_longer from column name
print(stringi::stri_enc_mark(d2$実施回))
#>  [1] "native" "native" "native" "native" "native" "native" "native" "native"...

Second column ("性別") contents has "UTF-8" encoding attribute, because they created by mutate step. Other-hands, third column ("実施回") contents has "native" encoding attribute, because converted from column name that has "native" encoding attribute.

So, dataframe "d2" has column with "UTF-8" encoding attribute and column with "native" encoding attribute. When I try to write this "d2" dataframe to CSV file by "write_csv", unexpected CSV files is generate.

  1. write "d2" to CSV repeatedly
write_csv(x = d2, file = "file01.csv")
write_csv(x = d2, file = "file02.csv")
write_csv(x = d2, file = "file03.csv")
write_csv(x = d2, file = "file04.csv")
write_csv(x = d2, file = "file05.csv")

I expect all files is same, but these file different from each other.

  1. On terminal, get diff among files
$ diff file01.csv file02.csv 
#> 16c16
#> < 3,女性,第7回,-0.6832698890819322
#> ---
#> > 3,女性,第1回,-0.6832698890819322
#> 147c147
#> < 21,女性,第6回,0.7099581154824097
#> ...

$ diff file03.csv file04.csv 
#> 28c28
#> < 4,男性,第5回,-0.19047686780971046
#> ---
#> > 4,男性,第6回,-0.19047686780971046
#> 646c646
#> < 93,女性,第1回,0.6614739457424788
#> ...
  1. To fix, encode all column to "UTF-8"
d3 <- d2 %>% 
  mutate(across(everything(), ~stringi::stri_encode(.x, to = "UTF-8"))) 

write_csv(x = d3, file = "file01.csv")
write_csv(x = d3, file = "file02.csv")
write_csv(x = d3, file = "file03.csv")
write_csv(x = d3, file = "file04.csv")
write_csv(x = d3, file = "file05.csv")

All files is same.

The problem is that at first glance it appears to be generated correctly. "write_csv" show no error or warning messages.
Is it possible to perform the write_csv correctly without converting it to UTF-8 by hand, or detecting errors?

Or Is this problem about dplyr (mutate) or tidyr (pivot_longer)?

@hadley
Copy link
Member

hadley commented Jul 31, 2023

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

@hadley hadley added the reprex needs a minimal reproducible example label Jul 31, 2023
@ujtwr
Copy link
Author

ujtwr commented Aug 1, 2023

I tried to create a reproduction code using the reprex package, but it seems that this problem has already been solved.

dplyr: 1.1.2
tidyr: 1.3.0
readr: 2.1.4
tibble: 3.2.1

Thank you!!

@hadley
Copy link
Member

hadley commented Aug 1, 2023

Great!

@hadley hadley closed this as completed Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding � reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

3 participants