truncate_html does not respect Unicode

Hi @hgmnz,

A client is running some content with Unicode characters (namely, an up arrow) through `truncate_html` and noticing that those characters are disappearing.

I've narrowed it down to the `scan` in `TruncateHtml::HtmlString`. However, that's a hell of a regex to read, so I was wondering if you wouldn't mind walking me through it.

You can paste this code into an `.rb` file and run it to see what I mean:

```
# encoding: utf-8
unicode_string = "Up Arrow (↑) points up."

# From TruncateHtml::HtmlString
# 
def regex
  /(?:<script.*>.*<\/script>)+|<\/?[^>]+>|[[[:alpha:]]\w\|`~!@#\$%^&*\-_\+=\[\]{}:;'",\.\/?]+|\s+|[[:punct:]]/
end

# scan normally respects unicode.
puts unicode_string.scan(/.*/).join

# but this regex does not.
puts unicode_string.scan(regex).join
```

The result at the command line is

```
Up Arrow (↑) points up.
Up Arrow () points up.
```

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

truncate_html does not respect Unicode #35

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

truncate_html does not respect Unicode #35

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions