Write_csv creates unexpeted csv file if dataframe has columns with mixed encoding attribute. #1412

ujtwr · 2022-07-18T05:50:22Z

if dataframe has columns with mixed encoding attribute, write_csv gerenates unexpected CSV file.

create tibble with utf-8 (Japanese) column name.

library(tidyverse)

d1 <- tibble(
  id = seq(1, 1000)
) %>% 
  mutate(
    性別 = sample(x = c("男性","女性"), size = 1000, replace = TRUE),
    第1回 = rnorm(1000),
    第2回 = rnorm(1000),
    第3回 = rnorm(1000),
    第4回 = rnorm(1000),
    第5回 = rnorm(1000),
    第6回 = rnorm(1000),
    第7回 = rnorm(1000),
    )

check column name encoding attribute and column contents encoding attribute

print(stringi::stri_enc_mark(colnames(d1)))
#> "ASCII"  "native" "native" "native" "native" "native" "native" "native" "native"

Encoding(colnames(d1))
#> [1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

## check encoding attribute of column contents includeing Japanese char
Encoding(d1$性別)
#>  [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" ...

convert Japanese column name with "native" encoding attribute to row by pivot_longer

d2 <- d1 %>% 
  pivot_longer(cols=!c(id, 性別), names_to = "実施回", values_to = "スコア")

check column name encoding attribute and column contents encoding attribute

print(stringi::stri_enc_mark(colnames(d2)))
#> [1] "ASCII"  "native" "UTF-8"  "UTF-8" 

# second column created by mutate step
print(stringi::stri_enc_mark(d2$性別))
#>  [1] "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8" "UTF-8"...

# third column converted by pivot_longer from column name
print(stringi::stri_enc_mark(d2$実施回))
#>  [1] "native" "native" "native" "native" "native" "native" "native" "native"...

Second column ("性別") contents has "UTF-8" encoding attribute, because they created by mutate step. Other-hands, third column ("実施回") contents has "native" encoding attribute, because converted from column name that has "native" encoding attribute.

So, dataframe "d2" has column with "UTF-8" encoding attribute and column with "native" encoding attribute. When I try to write this "d2" dataframe to CSV file by "write_csv", unexpected CSV files is generate.

write "d2" to CSV repeatedly

write_csv(x = d2, file = "file01.csv")
write_csv(x = d2, file = "file02.csv")
write_csv(x = d2, file = "file03.csv")
write_csv(x = d2, file = "file04.csv")
write_csv(x = d2, file = "file05.csv")

I expect all files is same, but these file different from each other.

On terminal, get diff among files

$ diff file01.csv file02.csv 
#> 16c16
#> < 3,女性,第7回,-0.6832698890819322
#> ---
#> > 3,女性,第1回,-0.6832698890819322
#> 147c147
#> < 21,女性,第6回,0.7099581154824097
#> ...

$ diff file03.csv file04.csv 
#> 28c28
#> < 4,男性,第5回,-0.19047686780971046
#> ---
#> > 4,男性,第6回,-0.19047686780971046
#> 646c646
#> < 93,女性,第1回,0.6614739457424788
#> ...

To fix, encode all column to "UTF-8"

d3 <- d2 %>% 
  mutate(across(everything(), ~stringi::stri_encode(.x, to = "UTF-8"))) 

write_csv(x = d3, file = "file01.csv")
write_csv(x = d3, file = "file02.csv")
write_csv(x = d3, file = "file03.csv")
write_csv(x = d3, file = "file04.csv")
write_csv(x = d3, file = "file05.csv")

All files is same.

The problem is that at first glance it appears to be generated correctly. "write_csv" show no error or warning messages.
Is it possible to perform the write_csv correctly without converting it to UTF-8 by hand, or detecting errors?

Or Is this problem about dplyr (mutate) or tidyr (pivot_longer)?

hadley · 2023-07-31T22:37:34Z

Could you please rework your reproducible example to use the reprex package ? That makes it easier to see both the input and the output, formatted in such a way that I can easily re-run in a local session.

ujtwr · 2023-08-01T01:57:44Z

I tried to create a reproduction code using the reprex package, but it seems that this problem has already been solved.

dplyr: 1.1.2
tidyr: 1.3.0
readr: 2.1.4
tibble: 3.2.1

Thank you!!

hadley · 2023-08-01T12:13:20Z

Great!

kevinushey mentioned this issue Aug 17, 2022

UTF-8 problem with rds file with R 4.2.1 rstudio/rstudio#11774

Closed

sbearrows added the encoding � label Aug 25, 2022

hadley added the reprex needs a minimal reproducible example label Jul 31, 2023

hadley closed this as completed Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write_csv creates unexpeted csv file if dataframe has columns with mixed encoding attribute. #1412

Write_csv creates unexpeted csv file if dataframe has columns with mixed encoding attribute. #1412

ujtwr commented Jul 18, 2022 •

edited

Loading

hadley commented Jul 31, 2023

ujtwr commented Aug 1, 2023

hadley commented Aug 1, 2023

Write_csv creates unexpeted csv file if dataframe has columns with mixed encoding attribute. #1412

Write_csv creates unexpeted csv file if dataframe has columns with mixed encoding attribute. #1412

Comments

ujtwr commented Jul 18, 2022 • edited Loading

hadley commented Jul 31, 2023

ujtwr commented Aug 1, 2023

hadley commented Aug 1, 2023

ujtwr commented Jul 18, 2022 •

edited

Loading