开发者

Is ruby 1.9.2's new regex engine (Oniguruma) very slow?

I recently migrated from rails 2 to Rails 3 and thus got the new regular expression engine that comes by default in ruby 1.9.2.

I had heard a lot of good things about this regex engine. However, a portion of my app that relies heavily on regex has become very slow.

This is what I want to achieve: I need to check a string for some specific keywords. Once I hit a keyword, I need to modify the string to add a link to some site based on the keyword that matched. A string might contain more than one su开发者_开发知识库ch keywords, and I need to check the string for thousands of keywords. All this needs to happen in a matter of minutes, and everything was working fine with the logic in ruby 1.8.7.

Earlier it used to get done in a matter of seconds, and now it takes hours. I compared today running both simultaneously, and the ruby 1.8.7 got done in 2 seconds, whereas the 1.9.2 one took 1.5 hours! There is obviously something wrong.

My regular expressions looks like:

/.*\b(sometext)\b/i

Questions:

  1. Do I need to phrase by regex differently, or is there some other trick to speed up the matching process in ruby 1.9.2?
  2. Worst case, is there a way to use the ruby 1.8.7 regex engine with ruby 1.9.2?


You can drop the .* from your regex completely. All it does is match the entire string and then backtrack until your search string is found. Remove it and see if it's still as slow.


It may not be the regex engine, but the fact that 1.9.x has String encoding built-in and will default to UTF-8 (I think). Try forcing the encoding on your input string to US-ASCII.

source_string.force_encoding("US-ASCII")

Performing thousands of regexes on UTF-8, which is computationally expensive (comparatively) is likely to be a great deal slower.

This may or may not work. I haven't tested it, but it springs to mind, before the regex engine does, when we're talking about speed differences on this magnitude.

How big are your input strings? o_O

I'd also profile your algorithm to try and identify where the bottlenecks are. You may be surprised.


Just as a recommendation for dealing with multiple regex lookups:

Check into the Regexp.union method, or use regex '|' to or your expressions into groups. The engine is fast, but only you know how to best look for things, so it relies on you to set it up for success.

For example, you can search for multiple targets various ways:

if string[/\btarget1\b/] || string[/\btarget2\b/] ...

or

if string[/\b(?:target1|target2)\b/] ...

You can build that or'd list of targets however you want, but it will be faster than separate searches.

Use Ruby's Benchmark module to prove your work. :-)

And, sometimes it is useful to think outside the Ruby box. Consider using a database to do your searches. Set up your data correctly and a DBM can be incredibly fast.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜