You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm can't believe this hasn't been covered before, but I've as yet been
unable to find a solution to the following:
puts HTML5::HTMLParser.new.parse('Test dátá')
provides:
<html><head/><body>Test dรกtรก</body></html>
As can be seen, the text in the body has the wrong characters where á
should be, so I suspected a normal UTF8 conversion bug.
However, just to really mess with my mind, I thought the following would be
a more complete test to post here:
puts HTML5::HTMLParser.new.parse('Sámple Téxt Wíth Acceñts')
produces:
<html><head/><body>Sámple Téxt Wíth Acceñts</body></html>
which is correct!! My next step was to try removing each accent one by one,
until only the first á is present. Each attempt worked except the last,
which produced:
<html><head/><body>Sรกmple Text With Accents</body></html>
Clearly, there is something very strange here, and its causing major pain.
Does anyone have any suggests as to what's going on, and more importantly,
how to fix it?
Versions:
gem -v 1.0.1
html5 (0.10.0)
ruby 1.8.6
Ubuntu 7.10 systems
Many thanks, Sam
Original issue reported on code.google.com by [email protected] on 15 Feb 2008 at 11:44
The text was updated successfully, but these errors were encountered:
Aha, I've found a fix to the problem, although I wouldn't call it a full
solution.
Forcing the encoding used will eleviate the encoding problem:
puts HTML5::HTMLParser.new.parse('Test dátá', 'utf-8')
produces:
<html><head/><body>Test dátá</body></html>
Which is correct. My guess is that there must be a problem with the encoding
auto-detect routines (assuming an attempt is made to autodetect it!)
Would love to hear of a more complete solution.
Cheers, sam
Original issue reported on code.google.com by
[email protected]
on 15 Feb 2008 at 11:44The text was updated successfully, but these errors were encountered: