Releases: mudge/re2
0.7.0: MatchData begin & end
Thanks to an issue raised by @driskell about functionality missing from RE2's MatchData
(compared to MRI's) in #20, I'm happy to announce version 0.7.0 of re2, now including RE2::MatchData#begin
and RE2::MatchData#end
for finding the offset of matches in your searches.
The API is the same as the standard library's begin
and end
:
m = RE2('w(o+)').match('he said woohoo!')
m.begin(0)
# => 8
m.end(0)
# => 11
It also works with RE2's named captures:
m = RE2('w(?P<cheers>o+)').match('he said woohoo!')
m.begin('cheers')
# => 9
m.end(:cheers)
# => 11
Note that on versions of Ruby prior to 1.9, the offset will be in bytes while later Ruby versions will return the offset in characters. This is to be consistent with other string functions (such as length
and slicing with []
) so as to have the least surprising behaviour when dealing with multibyte characters. This is illustrated by the specs for this behaviour which cannot rely on the exact return value of begin
and end
.
As a technical aside: the trickiest part of implementing this was efficiently calculating the length of the offset string as different implementations of Ruby vary in their string functions. Prior to Ruby 1.9, the offset is calculated using simple pointer arithmetic but other versions will try to use rb_str_sublen
when available, falling back to rb_str_length
(and incurring the cost of an extra string allocation) on implementations such as Rubinius.
Many thanks to @driskell for originally contributing this and providing invaluable feedback during its development.
0.6.0: Introducing RE2::Scanner
Scanning
Thanks to a suggestion from Matthias Kadenbach, re2 now contains an API for incrementally scanning a string for matches. To use it, call scan
on an instance of RE2::Regexp
with the string you want to search:
scanner = RE2('(\d+)').scan("Some 1 long 23 string 4 containing 567 numbers")
scanner.scan #=> ["1"]
scanner.scan #= ["23"]
The scanner
in the example above is an instance of RE2::Scanner
which has one main method -- scan
-- which returns the next match. Once no more matches are found, scan
will return nil
. You can use rewind
to reset a scanner back to the beginning of the string.
The RE2::Scanner
class also implements Ruby's Enumerator
interface so you can call each
and to_enum
on it:
scanner = RE2('(\d+)').scan("Some 1 long 23 string 4 containing 567 numbers")
scanner.each do |match|
puts match
end
No more in-place replacement
This release removes methods that previously altered strings in-place. This means re2_sub!
and re2_gsub!
are gone and RE2.Replace
and RE2.GlobalReplace
now return new strings rather than modifying their input.
Encoding awareness
Again, thanks to a bug report by Matthias Kadenbach: in Ruby 1.9 and later, re2 will now set the correct encoding for strings.
m = RE2('(\w+)', :utf8 => true).match("foo")
m[1].encoding # => #<Encoding:UTF-8>