Slow parsing with regular expression

2023-03-09 06:19 问答作者：

I have the following regular expression for data validation:

lexer = /(?:
      (.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
      (.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
      (?:\s+([A-Za-z][A-Za-z0-9]{2}(?=\s))|(\s+))\s*
      (Z(?:RO[A-DHJ]|EQ[A-C]|HIB|PRO|PRP|RMA)|H(?:IB[2E]|ALB)|F(?:ER[2T]|LUP2|ST4Q))\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\s+\d{10}|\s+)\s*
      (\d{6})\s*
      (.*)(?=((?:\d{2}\/){2}\d{4}))\s*
      ((?:\d{2}\/){2}\d{4})\s*
      (\S+)
    )/x

The problem is that I have to iterate through a file with 10000 lines (average) performing the validation with the regular expression, resulting in a slow parsing application.

  filename = File.new(@file, "r")
  filename.each_line.with_index do |line, index|
    next if index < INFO_AT + 1

    lexer = /(?:
      (.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
      (.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
      (?:\s+([A-Za-z][A-Za-z0-9]{2}(?=\s))|(\s+))\s*
      (Z(?:RO[A-DHJ]|EQ[A-C]|HIB|PRO|PRP|RMA)|H(?:IB[2E]|ALB)|F(?:ER[2T]|LUP2|ST4Q))\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\s+\d{10}|\s+)\s*
      (\d{6})\s*
      (.*)(?=((?:\d{2}\/){2}\d{4}))\s*
      ((?:\d{2}\/){2}\d{4})\s*
      (\S+)
    )/x

    m = lexer.match(line)
    begin
      if (m) then ...

Edit Here you can find some of the lines that I need to parse: File

Edit II @Mike R

I'm parsing a file that contains 25 columns per line and each column might have it's own way of validation. Either it could be whitespace or a full char-set.

That validation is required since I have to drop away the line that doesn't match that kind of part.
Might not be necessary
It's necessary

I don't believe that the expression it's badly constructed, the lookahead it's used, maybe in the part that I repeated 开发者_C百科the code (I just don't remembered the capturing group index \1...\n, if this is what you mean!) I also believe that catastrophic backtracking is happening here.

If you see the file, maybe you'll understand why I'm doing this! Let's put as an example the first column. I have to match a "Part Number" and I don't have any rule of how to do this, examples:

123456789
1 555 989
0123456789123456789

Neither a simple \S+, or (\S+\s){1, } could solve this problem, Cause I won't be guaranting data integrity.

Ty!

Any improvement, suggestion?

~ Eder Quiñones

The regexp doesn't change in the loop, so pull the lexer = assignment up and out of the loop. ("Loop invariant code motion.")
I'm not sure if this will be faster or slower but the repeated subexpressions could just be a single outer repeating group.
So this is obvious, I'm sure, but it might make sense to write a real parser. Ruby has an advanced PEG parser generator called Treetop. The traditional approaches are either Lex + Yacc (such as flex + bison) or just a C program implementing a scanner and a recursive-descent parser. Old-school, yes, but easily written and kind of a fun change from yet another CRUD program.
I agree with the other responders that the expression is unnecessarily redundant, complicated and should probably be broken up into at least two separate and smaller ones. Two expressions could be interesting because they could both be oppositely anchored.

You're probably getting catastrophic backtracking.

Namely, but not only, because of your greedy \S+\s* sequences. They're likely invalid too, since:

(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*

would match 'abcd' (after a painful amount of backtracking).

Also, as point out in the other answer, seek to compile that regex definition out of your loop.

Your file is a format with fixed-width fields. Ruby has a string method called unpack that is specifically for parsing this type of file.

field_widths = [19,41,14,11,11,11,11,11,11,11] #etc
field_pattern = "A#{fields.join('A')}"

Then in your line loop:

row = line.unpack(field_pattern)

Now you have an array (row) with the contents of each field. You can then apply a regex to each one for validation. This is faster, more manageable, and also allows for field-specific error messages.

What are you actually looking for in the parsing?

As far as I can tell, there are only a few significant elements that you're looking for in the line.

You're looking for a alpha-numeric that starts with Z.
(Z(?:RO[A-DHJ]|EQ[A-C]|HIB|PRO|PRP|RMA)|H(?:IB[2E]|ALB)|F(?:ER[2T]|LUP2|ST4Q))\s*
You're looking for a 6 digit number
(\d{6})\s*
You're looking for a date
(?=((?:\d{2}\/){2}\d{4}))\s* ((?:\d{2}\/){2}\d{4})\s*

(Note that the last date expression is an example of how badly constructed this is. It has a lookahead for a date, immediately followed by a non-capturing match of that exact same date. The look-ahead is therefore, meaningless. See the other answer on catastrophic backtracking, which is undoubtedly happening here.)

Pretty much everything else in that expression could match anything/nothing that I find it hard to believe they're expressing rules of any meaningful form. Real data just doesn't vary that much, or it varies with some precise rule that could be expressed in one (or more regexes).

So, junk the expression completely and stop trying to do it all at once. Do individual pieces (splitting the line on \s as necessary) and break this up into a several small regex'es that actually match what you really need to verify.

继续阅读：jruby parsing regex ruby token

Slow parsing with regular expression

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？