ruby regex hangs

2023-03-23 15:51 问答作者：

I wrote a ruby script to process a large amount of documents and use the following URI to extract URIs from a document's string representation:

#Taken from: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
URI_REGEX = /
(                           # Capture 1: entire matched URL
  (?:
    [a-z][\w-]+:                # URL protocol and colon
    (?:
      \/{1,3}                        # 1-3 slashes
      |                             #   or
      [a-z0-9%]                     # Single letter or digit or '%'
    )
    |                           #   or
    www\d{0,3}[.]               # "www.", "www1.", "www2." … "www999."
    |                           #   or
    [a-z0-9.\-]+[.][a-z]{2,4}\/  # looks like domain name followed by a slash
  )
  (?:                           # One or more:
    [^\s()<>]+                      # Run of non-space, non-()&lt;&gt;
    |                               #   or
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
  )+
  (?:                           # End with:
    \(([^\s()<>]+|(\([^\s()<>]+\)))*\)  # balanced parens, up to 2 levels
    |                                   #   or
    [^\s`!()\[\]{};:'".,<>?«»“”‘’]        # not a space or one of these punct chars
  )
)/xi

It works pretty well for 99.9 percent of all documents but always hangs up my script when it encounters the following token in of the documents: token = "synsem:local:cat:(subcat:SubMot,adjuncts:Adjs,subj:Subj),"

I am using the standard ruby regexp oeprator: token =~ URI_REGEX and I don't get any exception or error message.

First I tried to solve the problem encapsulating the regex evaluation into a Timeout::timeoutblock, but this degrades perform开发者_如何转开发ance to much.

Any other ideas on how to solve this problem?

Your problem is catastrophic backtracking. I just loaded your regex and your test string into RegexBuddy, and it gave up after 1.000.000 iterations of the regex engine (and from the looks of it, it would have gone on for many millions more had it not aborted).

The problem arises because some parts of your text can be matched by different parts of your regex (which is horribly complicated and painful to read); it seems that the "One or more:" part of your regex and the "End with:" part struggle over the match (when it's not working), trying out millions of permutations that all fail.

It's difficult to suggest a solution without knowing what the rules for matching a URI are (which I don't). All this balancing of parentheses suggests to me that regexes may not be the right tool for the job. Maybe you could break down the problem. First use a simple regex to find everything that looks remotely like a URI, then validate that in a second step (isn't there a URI parser for Ruby of some sort?).

Another thing you might be able to do is to prevent the regex engine from backtracking by using atomic groups. If you can change some (?:...) groups into (?>...) groups, that would allow the regex to fail faster by disallowing backtracking into those groups. However, that might change the match and make it fail on occasions where backtracking is necessary to achieve a match at all - so that's not always an option.

Why reinvent the wheel?

require 'uri'
uri_list = URI.extract("Text containing URIs.")

URI.extract("Text containing URIs.") is the best solution if you only need the URIs.

I finally used pat = URI::Parser.new.make_regexp('http')to get the built-in URI parsing regexp and use it in match = str.match(pat, start_pos) to iteratively parse the input text URI by URI. I am doing this because I also need the URI positions in the text and the returned match object gives me this information match.begin(0).

继续阅读：regex ruby

ruby regex hangs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？