开发者

Ruby 1.8 regexp: index of match in utf string

I'm trying to search a text for a match and return it with snippet around it. For this, I want to find match with regex, then cut the string using match index +- snippet radius (text.mb_chars[start..finish]).

However, I cannot get ruby's (1.8) regex to return match index which would be multi-byte aware.

I understand that regex is one place in 1.8 which is supposed to be utf aware, but it doesn't seem to work despite /u switch:

"Résumé" =~ /s/u
=> 3

"Resume" =~ /s/u
=> 2

Result should be the same开发者_JAVA百科 if regex was really working in multibyte (/u), but it's returning byte index.

How you get match index in characters, not bytes?

Or maybe some other way to get snippet around (each) match?


Not a real answer, but too long for a comment.

The code

print "Résumé" =~ /s/u
print "\n"
print "Resume" =~ /s/u

on Windows (Ruby 1.8.6, release 26.) prints:

2
2

And on Linux (ruby 1.8.7 (2009-06-12 patchlevel 174) [i486-linux]) it prints:

3
2


How about using this jindex function I wrote, which corresponds to the other methods in the jcode library:

class String
  def jslice *args
    split(//)[*args].join rescue ""
  end
  def jindex match, start=0
    if match.is_a? String
      match = Regexp.new(Regexp.escape(match))
    end
    if self.jslice(start..-1) =~ match
      $PREMATCH.jlength + start
    else
      nil
    end
  end
end
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜