Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TidyManaged encoding can not support GBK or GB2312 #13

Open
garysun90 opened this issue May 23, 2017 · 1 comment
Open

TidyManaged encoding can not support GBK or GB2312 #13

garysun90 opened this issue May 23, 2017 · 1 comment

Comments

@garysun90
Copy link

if I input chinese characters,it will output ???.

  using (Document doc = Document.FromString("已完成,请查看"))
        {
            doc.ShowWarnings = false;
            doc.Quiet = true;
            doc.OutputXhtml = true;
            doc.CleanAndRepair();
            parsed = doc.Save();
            Console.WriteLine(parsed);
        }
@garysun90
Copy link
Author

garysun90 commented May 23, 2017

I have got a solution. modify save function and CleanAndRepair which in Document class.
///


/// Parses input markup, and executes configured cleanup and repair operations.
///

public void CleanAndRepair()
{
if (fromString)
{
EncodingType tempEnc = this.InputCharacterEncoding;
this.InputCharacterEncoding = EncodingType.Big5;
PInvoke.tidyParseString(this.handle, this.htmlString);
this.InputCharacterEncoding = tempEnc;
}
else
{
InputSource input = new InputSource(this.stream);
PInvoke.tidyParseSource(this.handle, ref input.TidyInputSource);
}
PInvoke.tidyCleanAndRepair(this.handle);
cleaned = true;
}

	/// <summary>
	/// Saves the processed markup to a string.
	/// </summary>
	/// <returns>A string containing the processed markup.</returns>
	public string Save()
	{
		if (!cleaned)
			throw new InvalidOperationException("CleanAndRepair() must be called before Save().");
		var tempEnc = this.CharacterEncoding;
		var tempBOM = this.OutputByteOrderMark;
		this.OutputCharacterEncoding = EncodingType.Utf8;
		this.OutputByteOrderMark = AutoBool.No;

		uint bufferLength = 1;
		byte[] htmlBytes;
		GCHandle handle = new GCHandle();
		do
		{
			// Buffer was too small - bufferLength should now be the required length, so try again...
			if (handle.IsAllocated) handle.Free();

			// this setting appears to be reset by libtidy after calling tidySaveString; we need to set it each time
			this.OutputCharacterEncoding = EncodingType.Big5;

			htmlBytes = new byte[bufferLength];
			handle = GCHandle.Alloc(htmlBytes, GCHandleType.Pinned);
		} while (PInvoke.tidySaveString(this.handle, handle.AddrOfPinnedObject(), ref bufferLength) == -12);

		handle.Free();

		this.OutputCharacterEncoding = tempEnc;
		this.OutputByteOrderMark = tempBOM;
		return Encoding.GetEncoding("GB2312").GetString(htmlBytes);
	}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant