-
Notifications
You must be signed in to change notification settings - Fork 68
Open
Description
Hi @hgmnz,
A client is running some content with Unicode characters (namely, an up arrow) through truncate_html and noticing that those characters are disappearing.
I've narrowed it down to the scan in TruncateHtml::HtmlString. However, that's a hell of a regex to read, so I was wondering if you wouldn't mind walking me through it.
You can paste this code into an .rb file and run it to see what I mean:
# encoding: utf-8
unicode_string = "Up Arrow (↑) points up."
# From TruncateHtml::HtmlString
#
def regex
/(?:<script.*>.*<\/script>)+|<\/?[^>]+>|[[[:alpha:]]\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+|[[:punct:]]/
end
# scan normally respects unicode.
puts unicode_string.scan(/.*/).join
# but this regex does not.
puts unicode_string.scan(regex).join
The result at the command line is
Up Arrow (↑) points up.
Up Arrow () points up.
Thanks!
rob-mcgrail
Metadata
Metadata
Assignees
Labels
No labels