-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with encoding #2
Comments
Hi there - can you provide a sample HTML document? My (pretty rudimentary) testing seems to work fine with the accented character. |
Hi,
There probably is a more elegant solution, but this works for me. |
Here's a unit test to demonstrate what I assume is the same problem:
This outputs:
In contrast, from the command line, tidy -utf8 test.htm Does the expected: the two characters emerge in the same for that they went in. |
After more experimenting, and trying to figure out why Radek's code works, I found that libtidy is apparently ignoring the CharacterEncoding property (at least with respect to the issue at hand). The documentation says:
Yet it seems to have no effect, at least with this issue of converting characters to their numeric character references when it shouldn't. I have tested this with streams and files. I have not successfully got a .net string to pass through without this unwanted conversion, using the Document.FromString() method. So for files and streams, the solution is to not use CharacterEncoding, and explicitly set the InputCharacterEncoding and OutputCharacterEncoding to EncodingType.Utf8. Short of fixing LibTidy itself, it seems we could change the TidyManaged wrapper to either drop the unnecessary and broken CharacterEncoding property, or have it explicitly set the other two. |
Looking through the code, I wonder if we're asking for trouble with statements like this:
What if the client instead used the InputEncoding and OutputEncoding parameters? What would the value of CharacterEncoding be at this point? In the end, I think this CharacterEncoding property muddies the semantics and leads to errors. I understand that this comes from the c DLL, but this wrapper might still be better off dropping support for it. |
This also works for me fine: MemoryStream str = new MemoryStream(Encoding.UTF8.GetBytes(input));
using (TidyManaged.Document doc = TidyManaged.Document.FromStream(str))
{
doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.CleanAndRepair();
output = doc.Save();
}
str.Close(); IMHO the problem is in BTW: I think that it would be nice to have UTF8 encoding - the standard in .NET world - as default. Is anybody able to fix those issues? They're really annoying. |
So, How could I do to use FromStream method then get the right output? |
@smirkchung - hrnr's example works for me - input and output are strings |
Thanks for your help! It works for me. |
Hi, I'm having problems with using the wrapper and UTF-8 strings. For example "é" would be replaced by �
I was just wondering if I'm doing anything fundamentally wrong? I've tried setting Input and Output character encoding values as well. Any help would be appreciated. Example code below.
Many thanks
using (TidyManaged.Document doc = TidyManaged.Document.FromString(myInput))
{
doc.OutputXhtml = true;
doc.CharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.CleanAndRepair();
myOutput = doc.Save();
}
The text was updated successfully, but these errors were encountered: