Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with encoding #2

Open
stebrennancode opened this issue Mar 21, 2011 · 9 comments
Open

Problems with encoding #2

stebrennancode opened this issue Mar 21, 2011 · 9 comments

Comments

@stebrennancode
Copy link

Hi, I'm having problems with using the wrapper and UTF-8 strings. For example "é" would be replaced by �

I was just wondering if I'm doing anything fundamentally wrong? I've tried setting Input and Output character encoding values as well. Any help would be appreciated. Example code below.

Many thanks

using (TidyManaged.Document doc = TidyManaged.Document.FromString(myInput))
{
doc.OutputXhtml = true;
doc.CharacterEncoding = TidyManaged.EncodingType.Utf8;
doc.CleanAndRepair();
myOutput = doc.Save();
}

@markbeaton
Copy link
Owner

Hi there - can you provide a sample HTML document? My (pretty rudimentary) testing seems to work fine with the accented character.

@RadekMlada
Copy link

Hi,
I had similar problem. Characters ěščřžýáíéúů were replaced by � after running the text through the parser. The text came from database, where it was stored with windows-1250 encoding. What I ended up with (after half a day of � spam) was this solution.

        //converts str using its initial encoding to bytes, convert those bytes to encoding 
        //we want to use for parsing and get stream from that to be safe that .NET does not 
        //meddle with it
        Encoding srcEncoding = Encoding.GetEncoding("windows-1250");
        byte[] srcEncodingBytes = srcEncoding.GetBytes(str);
        Encoding destEncoding = Encoding.UTF8;
        byte[] destEncodingBytes = Encoding.Convert(srcEncoding, destEncoding, srcEncodingBytes);
        var strStream = new MemoryStream(destEncodingBytes);

        //do the parsing
        var doc = TidyManaged.Document.FromStream(strStream);
        doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
        doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
        doc.CharacterEncoding = TidyManaged.EncodingType.Utf8;
        doc.ShowWarnings = false;
        doc.Quiet = true;
        doc.OutputXhtml = true;
        doc.CleanAndRepair();
        str = doc.Save(); 

There probably is a more elegant solution, but this works for me.

@hatton
Copy link

hatton commented Dec 3, 2011

Here's a unit test to demonstrate what I assume is the same problem:

[Test]
public void RoundTripsUtf8File()
{
    // ŋ (velar nasal)--> ŋ
    // β (greek beta) (03B2) --> β
    using (var input = TempFile.CreateAndGetPathButDontMakeTheFile())
    {
        var source = "<!DOCTYPE html><html><head> <meta charset='UTF-8'></head><body>ŋ β</body></html>";
        File.WriteAllText(input.Path, source, Encoding.UTF8);
        using (var tidy = TidyManaged.Document.FromFile(input.Path))
        {
            tidy.CharacterEncoding = EncodingType.Utf8; //tried Raw, too
            tidy.CleanAndRepair();
            using (var output = new TempFile())
            {
                tidy.Save(output.Path);
                var newContents = File.ReadAllText(output.Path);
                Assert.IsTrue(newContents.Contains("ŋ"), newContents);
            }
        }
    }
}

This outputs:

<!DOCTYPE html>
<html>
<head>
<meta name="generator" content=
"HTML Tidy for Windows (vers 14 October 2008), see www.w3.org">
<meta charset='UTF-8'>
<title></title>
</head>
<body>
&#331; &#946;
</body>
</html>
Expected: True
But was:  False

In contrast, from the command line,

tidy -utf8 test.htm

Does the expected: the two characters emerge in the same for that they went in.

@hatton
Copy link

hatton commented Dec 4, 2011

After more experimenting, and trying to figure out why Radek's code works, I found that libtidy is apparently ignoring the CharacterEncoding property (at least with respect to the issue at hand). The documentation says:

This option specifies the character encoding Tidy uses for both the input and output.

Yet it seems to have no effect, at least with this issue of converting characters to their numeric character references when it shouldn't. I have tested this with streams and files. I have not successfully got a .net string to pass through without this unwanted conversion, using the Document.FromString() method.

So for files and streams, the solution is to not use CharacterEncoding, and explicitly set the InputCharacterEncoding and OutputCharacterEncoding to EncodingType.Utf8.

Short of fixing LibTidy itself, it seems we could change the TidyManaged wrapper to either drop the unnecessary and broken CharacterEncoding property, or have it explicitly set the other two.

@hatton
Copy link

hatton commented Dec 5, 2011

Looking through the code, I wonder if we're asking for trouble with statements like this:

var tempEnc = this.CharacterEncoding;

What if the client instead used the InputEncoding and OutputEncoding parameters? What would the value of CharacterEncoding be at this point?

In the end, I think this CharacterEncoding property muddies the semantics and leads to errors. I understand that this comes from the c DLL, but this wrapper might still be better off dropping support for it.

@hrnr
Copy link

hrnr commented Jan 10, 2012

This also works for me fine:

MemoryStream str = new MemoryStream(Encoding.UTF8.GetBytes(input));
            using (TidyManaged.Document doc = TidyManaged.Document.FromStream(str))
            {
                doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.CleanAndRepair();
                output = doc.Save();
            }
            str.Close();

IMHO the problem is in Document.FromString() method.
I agree with hatton that Document.CharacterEncoding do nothing at all.

BTW: I think that it would be nice to have UTF8 encoding - the standard in .NET world - as default.

Is anybody able to fix those issues? They're really annoying.

@smirkchung
Copy link

So, How could I do to use FromStream method then get the right output?

@rangler2
Copy link

rangler2 commented Jan 5, 2016

@smirkchung - hrnr's example works for me - input and output are strings

@bao-vn
Copy link

bao-vn commented Nov 13, 2018

This also works for me fine:

MemoryStream str = new MemoryStream(Encoding.UTF8.GetBytes(input));
            using (TidyManaged.Document doc = TidyManaged.Document.FromStream(str))
            {
                doc.InputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.OutputCharacterEncoding = TidyManaged.EncodingType.Utf8;
                doc.CleanAndRepair();
                output = doc.Save();
            }
            str.Close();

IMHO the problem is in Document.FromString() method.
I agree with hatton that Document.CharacterEncoding do nothing at all.

BTW: I think that it would be nice to have UTF8 encoding - the standard in .NET world - as default.

Is anybody able to fix those issues? They're really annoying.

Thanks for your help! It works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants