Problems with encoding #2

stebrennancode · 2011-03-21T14:28:24Z

Hi, I'm having problems with using the wrapper and UTF-8 strings. For example "é" would be replaced by �

I was just wondering if I'm doing anything fundamentally wrong? I've tried setting Input and Output character encoding values as well. Any help would be appreciated. Example code below.

Many thanks

using (TidyManaged.Document doc = TidyManaged.Document.FromString(myInput))
{
doc.OutputXhtml = true;
doc.CharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.CleanAndRepair();
myOutput = doc.Save();
}

markbeaton · 2011-06-10T01:19:57Z

Hi there - can you provide a sample HTML document? My (pretty rudimentary) testing seems to work fine with the accented character.

RadekMlada · 2011-08-12T21:17:05Z

Hi,
I had similar problem. Characters ěščřžýáíéúů were replaced by � after running the text through the parser. The text came from database, where it was stored with windows-1250 encoding. What I ended up with (after half a day of � spam) was this solution.

        //converts str using its initial encoding to bytes, convert those bytes to encoding 
        //we want to use for parsing and get stream from that to be safe that .NET does not 
        //meddle with it
        Encoding srcEncoding = Encoding.GetEncoding("windows-1250");
        byte[] srcEncodingBytes = srcEncoding.GetBytes(str);
        Encoding destEncoding = Encoding.UTF8;
        byte[] destEncodingBytes = Encoding.Convert(srcEncoding, destEncoding, srcEncodingBytes);
        var strStream = new MemoryStream(destEncodingBytes);

        //do the parsing
        var doc = TidyManaged.Document.FromStream(strStream);
        doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
        doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
        doc.CharacterEncoding = TidyManaged.EncodingType.Utf8;
        doc.ShowWarnings = false;
        doc.Quiet = true;
        doc.OutputXhtml = true;
        doc.CleanAndRepair();
        str = doc.Save();

There probably is a more elegant solution, but this works for me.

hatton · 2011-12-03T18:00:23Z

Here's a unit test to demonstrate what I assume is the same problem:

[Test]
public void RoundTripsUtf8File()
{
    // ŋ (velar nasal)--> &#331;
    // β (greek beta) (03B2) --> &#946;
    using (var input = TempFile.CreateAndGetPathButDontMakeTheFile())
    {
        var source = "<!DOCTYPE html><html><head> <meta charset='UTF-8'></head><body>ŋ β</body></html>";
        File.WriteAllText(input.Path, source, Encoding.UTF8);
        using (var tidy = TidyManaged.Document.FromFile(input.Path))
        {
            tidy.CharacterEncoding = EncodingType.Utf8; //tried Raw, too
            tidy.CleanAndRepair();
            using (var output = new TempFile())
            {
                tidy.Save(output.Path);
                var newContents = File.ReadAllText(output.Path);
                Assert.IsTrue(newContents.Contains("ŋ"), newContents);
            }
        }
    }
}

This outputs:

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 October 2008), see www.w3.org">
<meta charset='UTF-8'>
<title></title>
</head>
<body>
&#331; &#946;
</body>
</html>

Expected: True
But was:  False

In contrast, from the command line,

tidy -utf8 test.htm

Does the expected: the two characters emerge in the same for that they went in.

hatton · 2011-12-04T14:27:04Z

After more experimenting, and trying to figure out why Radek's code works, I found that libtidy is apparently ignoring the CharacterEncoding property (at least with respect to the issue at hand). The documentation says:

This option specifies the character encoding Tidy uses for both the input and output.

Yet it seems to have no effect, at least with this issue of converting characters to their numeric character references when it shouldn't. I have tested this with streams and files. I have not successfully got a .net string to pass through without this unwanted conversion, using the Document.FromString() method.

So for files and streams, the solution is to not use CharacterEncoding, and explicitly set the InputCharacterEncoding and OutputCharacterEncoding to EncodingType.Utf8.

Short of fixing LibTidy itself, it seems we could change the TidyManaged wrapper to either drop the unnecessary and broken CharacterEncoding property, or have it explicitly set the other two.

hatton · 2011-12-05T14:11:50Z

Looking through the code, I wonder if we're asking for trouble with statements like this:

var tempEnc = this.CharacterEncoding;

What if the client instead used the InputEncoding and OutputEncoding parameters? What would the value of CharacterEncoding be at this point?

In the end, I think this CharacterEncoding property muddies the semantics and leads to errors. I understand that this comes from the c DLL, but this wrapper might still be better off dropping support for it.

hrnr · 2012-01-10T16:27:07Z

This also works for me fine:

MemoryStream str = new MemoryStream(Encoding.UTF8.GetBytes(input));
            using (TidyManaged.Document doc = TidyManaged.Document.FromStream(str))
            {
                doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.CleanAndRepair();
                output = doc.Save();
            }
            str.Close();

IMHO the problem is in Document.FromString() method.
I agree with hatton that Document.CharacterEncoding do nothing at all.

BTW: I think that it would be nice to have UTF8 encoding - the standard in .NET world - as default.

Is anybody able to fix those issues? They're really annoying.

smirkchung · 2015-02-09T17:02:41Z

So, How could I do to use FromStream method then get the right output?

rangler2 · 2016-01-05T15:50:25Z

@smirkchung - hrnr's example works for me - input and output are strings

bao-vn · 2018-11-13T16:35:13Z

This also works for me fine:

MemoryStream str = new MemoryStream(Encoding.UTF8.GetBytes(input));
            using (TidyManaged.Document doc = TidyManaged.Document.FromStream(str))
            {
                doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.CleanAndRepair();
                output = doc.Save();
            }
            str.Close();

IMHO the problem is in Document.FromString() method.
I agree with hatton that Document.CharacterEncoding do nothing at all.

BTW: I think that it would be nice to have UTF8 encoding - the standard in .NET world - as default.

Is anybody able to fix those issues? They're really annoying.

Thanks for your help! It works for me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with encoding #2

Problems with encoding #2

stebrennancode commented Mar 21, 2011

markbeaton commented Jun 10, 2011

RadekMlada commented Aug 12, 2011

hatton commented Dec 3, 2011

hatton commented Dec 4, 2011

hatton commented Dec 5, 2011

hrnr commented Jan 10, 2012

smirkchung commented Feb 9, 2015

rangler2 commented Jan 5, 2016

bao-vn commented Nov 13, 2018

Problems with encoding #2

Problems with encoding #2

Comments

stebrennancode commented Mar 21, 2011

markbeaton commented Jun 10, 2011

RadekMlada commented Aug 12, 2011

hatton commented Dec 3, 2011

hatton commented Dec 4, 2011

hatton commented Dec 5, 2011

hrnr commented Jan 10, 2012

smirkchung commented Feb 9, 2015

rangler2 commented Jan 5, 2016

bao-vn commented Nov 13, 2018