Is ruby 1.9.2's new regex engine (Oniguruma) very slow?

2023-04-06 07:36 问答作者：

I recently migrated from rails 2 to Rails 3 and thus got the new regular expression engine that comes by default in ruby 1.9.2.

I had heard a lot of good things about this regex engine. However, a portion of my app that relies heavily on regex has become very slow.

This is what I want to achieve: I need to check a string for some specific keywords. Once I hit a keyword, I need to modify the string to add a link to some site based on the keyword that matched. A string might contain more than one su开发者_开发知识库ch keywords, and I need to check the string for thousands of keywords. All this needs to happen in a matter of minutes, and everything was working fine with the logic in ruby 1.8.7.

Earlier it used to get done in a matter of seconds, and now it takes hours. I compared today running both simultaneously, and the ruby 1.8.7 got done in 2 seconds, whereas the 1.9.2 one took 1.5 hours! There is obviously something wrong.

My regular expressions looks like:

/.*\b(sometext)\b/i

Questions:

Do I need to phrase by regex differently, or is there some other trick to speed up the matching process in ruby 1.9.2?
Worst case, is there a way to use the ruby 1.8.7 regex engine with ruby 1.9.2?

You can drop the .* from your regex completely. All it does is match the entire string and then backtrack until your search string is found. Remove it and see if it's still as slow.

It may not be the regex engine, but the fact that 1.9.x has String encoding built-in and will default to UTF-8 (I think). Try forcing the encoding on your input string to US-ASCII.

source_string.force_encoding("US-ASCII")

Performing thousands of regexes on UTF-8, which is computationally expensive (comparatively) is likely to be a great deal slower.

This may or may not work. I haven't tested it, but it springs to mind, before the regex engine does, when we're talking about speed differences on this magnitude.

How big are your input strings? o_O

I'd also profile your algorithm to try and identify where the bottlenecks are. You may be surprised.

Just as a recommendation for dealing with multiple regex lookups:

Check into the Regexp.union method, or use regex '|' to or your expressions into groups. The engine is fast, but only you know how to best look for things, so it relies on you to set it up for success.

For example, you can search for multiple targets various ways:

if string[/\btarget1\b/] || string[/\btarget2\b/] ...

if string[/\b(?:target1|target2)\b/] ...

You can build that or'd list of targets however you want, but it will be faster than separate searches.

Use Ruby's Benchmark module to prove your work. :-)

And, sometimes it is useful to think outside the Ruby box. Consider using a database to do your searches. Set up your data correctly and a DBM can be incredibly fast.

继续阅读：regex ruby ruby-on-rails-3

Is ruby 1.9.2's new regex engine (Oniguruma) very slow?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？