-
-
Notifications
You must be signed in to change notification settings - Fork 320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use with-utf8 and don't use Data.Text.IO. #1834
Conversation
31d6677
to
16058e5
Compare
hledger-lib/Hledger/Utils.hs
Outdated
when (fmap show menc == Just "UTF-8") $ -- XXX no Eq instance, rely on Show | ||
hSetEncoding h utf8_bom | ||
T.hGetContents h | ||
readHandlePortably h = hSetNewlineMode h universalNewlineMode *> TIO.hGetContents h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, setting utf8_bom encoding made reading utf8 files more robust, by also accepting the utf8 bom prefix that some of them have. Why isn't this needed any more ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasoning was that I read something which recommended against it, which I thought was in the with-utf8
documentation, but I cannot find it anymore. I can leave it in if you'd prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is that with-utf8
already sets the encoding, and so if we do it ourselves it will just be overridden. If setting the BOM is indeed desirable (which I have little knowledge of), it should probably be done upstream.
hledger-lib/Hledger/Utils.hs
Outdated
when (fmap show menc == Just "UTF-8") $ -- XXX no Eq instance, rely on Show | ||
hSetEncoding h utf8_bom | ||
T.hGetContents h | ||
readHandlePortably h = hSetNewlineMode h universalNewlineMode *> TIO.hGetContents h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this read with the system locale's encoding as before ? Shouldn't we be enforcing utf8 now ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it does. openFile
comes from with-utf8
(I'll qualify the import so that's more obvious), and stdin
should be taken care of by the withUtf8
wrapper around main
.
I remember we discussed this but I forgot where. This seems to have potential to be unexpectedly breaking and in that case quite annoying for users, so we should give it a detailed commit message with rationale (=> changelog/migration note), and ideally a preview release and a loud call for testing. I have missed apr 1 and may 1 but might still rush out a preview release when this lands. |
I've rebased onto master. I updated the commit message with a bit more detail, but my brain is too tired to have much of a go of it at the moment. Let me know if there are specific points you'd like addressed in the message. |
…ith utf8 encoding. (simonmichael#1619) May also fix simonmichael#1154, simonmichael#1033, simonmichael#708, simonmichael#536, simonmichael#73: testing is needed. This aims to solve all problems where misconfigured locales lead to parsers failing on utf8-encoded data. This should hopefully avoid encoding issues, but since it fundamentally alters how encoding is dealt with it may lead to unexpected outcomes. Widespread testing on a number of different platforms would be useful.
For the record, we asked about adding https://en.wikipedia.org/wiki/Byte_order_mark support at serokell/haskell-with-utf8#14 and meanwhile @Xitian9 has kept our existing BOM support in place. It seems to still be reasonably common to encounter it in the wild. I have proposed we add a backward compatibility flag to restore old file reading behaviour anticipating a long tail of bug reports from users somehow relying on the old behaviour as they upgrade over the coming years. Let's revisit this after some testing. |
Merged, release docs to follow. We may or may not add a flag to restore old reading behaviour later if it seems worthwhile. Otherwise, now that this has landed in master, any and all testing by folks with real-world non-ascii data is most welcome and helpful. [https://groups.google.com/g/hledger/c/0gFC7xtsteI/m/lX3Fn2veBQAJ] |
We decided to revert this for 1.26 (#1864) and contemplate it a bit more, following discussion eg on the mail list: |
fix: utf-8: Use with-utf8 to ensure all files are read and written with utf8 encoding. (Fixes #1619)
May also fix #1154, #1033, #708, #536, and #73. Testing is needed.