Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use with-utf8 and don't use Data.Text.IO. #1834

Merged
merged 1 commit into from
May 22, 2022

Conversation

Xitian9
Copy link
Collaborator

@Xitian9 Xitian9 commented Mar 10, 2022

fix: utf-8: Use with-utf8 to ensure all files are read and written with utf8 encoding. (Fixes #1619)

May also fix #1154, #1033, #708, #536, and #73. Testing is needed.

@Xitian9 Xitian9 changed the title fix: utf-8: Use with-utf8 to ensure all files are read and written with utf8 encoding. (#1619) Use with-utf8 and don't use Data.Text.IO. Mar 10, 2022
@Xitian9 Xitian9 force-pushed the utf8 branch 6 times, most recently from 31d6677 to 16058e5 Compare March 12, 2022 10:44
when (fmap show menc == Just "UTF-8") $ -- XXX no Eq instance, rely on Show
hSetEncoding h utf8_bom
T.hGetContents h
readHandlePortably h = hSetNewlineMode h universalNewlineMode *> TIO.hGetContents h
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, setting utf8_bom encoding made reading utf8 files more robust, by also accepting the utf8 bom prefix that some of them have. Why isn't this needed any more ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning was that I read something which recommended against it, which I thought was in the with-utf8 documentation, but I cannot find it anymore. I can leave it in if you'd prefer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that with-utf8 already sets the encoding, and so if we do it ourselves it will just be overridden. If setting the BOM is indeed desirable (which I have little knowledge of), it should probably be done upstream.

hledger-lib/Hledger/Utils.hs Outdated Show resolved Hide resolved
when (fmap show menc == Just "UTF-8") $ -- XXX no Eq instance, rely on Show
hSetEncoding h utf8_bom
T.hGetContents h
readHandlePortably h = hSetNewlineMode h universalNewlineMode *> TIO.hGetContents h
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this read with the system locale's encoding as before ? Shouldn't we be enforcing utf8 now ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it does. openFile comes from with-utf8 (I'll qualify the import so that's more obvious), and stdin should be taken care of by the withUtf8 wrapper around main.

@simonmichael
Copy link
Owner

simonmichael commented May 9, 2022

PS, I'm a little confused, github says this (8f998f7) has no conflicts with master (de70e0d) but doing the merge locally fails for me with either a bunch of conflicts or "refusing to merge unrelated branch". Is it PEBKAC, or is there something odd about this branch ?

@simonmichael
Copy link
Owner

simonmichael commented May 9, 2022

I remember we discussed this but I forgot where. This seems to have potential to be unexpectedly breaking and in that case quite annoying for users, so we should give it a detailed commit message with rationale (=> changelog/migration note), and ideally a preview release and a loud call for testing. I have missed apr 1 and may 1 but might still rush out a preview release when this lands.

@Xitian9
Copy link
Collaborator Author

Xitian9 commented May 10, 2022

I've rebased onto master. I updated the commit message with a bit more detail, but my brain is too tired to have much of a go of it at the moment. Let me know if there are specific points you'd like addressed in the message.

…ith utf8 encoding. (simonmichael#1619)

May also fix simonmichael#1154, simonmichael#1033, simonmichael#708, simonmichael#536, simonmichael#73: testing is needed.

This aims to solve all problems where misconfigured locales lead to
parsers failing on utf8-encoded data. This should hopefully avoid
encoding issues, but since it fundamentally alters how encoding is dealt
with it may lead to unexpected outcomes. Widespread testing on a number
of different platforms would be useful.
@simonmichael
Copy link
Owner

For the record, we asked about adding https://en.wikipedia.org/wiki/Byte_order_mark support at serokell/haskell-with-utf8#14 and meanwhile @Xitian9 has kept our existing BOM support in place. It seems to still be reasonably common to encounter it in the wild.

I have proposed we add a backward compatibility flag to restore old file reading behaviour anticipating a long tail of bug reports from users somehow relying on the old behaviour as they upgrade over the coming years. Let's revisit this after some testing.

@simonmichael simonmichael merged commit 65e913b into simonmichael:master May 22, 2022
@simonmichael
Copy link
Owner

simonmichael commented May 22, 2022

Merged, release docs to follow. We may or may not add a flag to restore old reading behaviour later if it seems worthwhile.

Otherwise, now that this has landed in master, any and all testing by folks with real-world non-ascii data is most welcome and helpful.

[https://groups.google.com/g/hledger/c/0gFC7xtsteI/m/lX3Fn2veBQAJ]

@simonmichael
Copy link
Owner

We decided to revert this for 1.26 (#1864) and contemplate it a bit more, following discussion eg on the mail list:
https://groups.google.com/g/hledger/c/0gFC7xtsteI/m/lX3Fn2veBQAJ
Thanks @Xitian9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Windows-built hledger executable fails in WSL Data.Text.IO should not be used
2 participants