开发者

Slow parsing with regular expression

I have the following regular expression for data validation:

lexer = /(?:
      (.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
      (.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
      (?:\s+([A-Za-z][A-Za-z0-9]{2}(?=\s))|(\s+))\s*
      (Z(?:RO[A-DHJ]|EQ[A-C]|HIB|PRO|PRP|RMA)|H(?:IB[2E]|ALB)|F(?:ER[2T]|LUP2|ST4Q))\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\s+\d{10}|\s+)\s*
      (\d{6})\s*
      (.*)(?=((?:\d{2}\/){2}\d{4}))\s*
      ((?:\d{2}\/){2}\d{4})\s*
      (\S+)
    )/x

The problem is that I have to iterate through a file with 10000 lines (average) performing the validation with the regular expression, resulting in a slow parsing application.

  filename = File.new(@file, "r")
  filename.each_line.with_index do |line, index|
    next if index < INFO_AT + 1

    lexer = /(?:
      (.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
      (.{18}|(?:.*)(?=\s\S{2,})|(?:[^\s+]\s){1,})\s*
      (?:\s+([A-Za-z][A-Za-z0-9]{2}(?=\s))|(\s+))\s*
      (Z(?:RO[A-DHJ]|EQ[A-C]|HIB|PRO|PRP|RMA)|H(?:IB[2E]|ALB)|F(?:ER[2T]|LUP2|ST4Q))\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\S+)\s*
      (\s+\d{10}|\s+)\s*
      (\d{6})\s*
      (.*)(?=((?:\d{2}\/){2}\d{4}))\s*
      ((?:\d{2}\/){2}\d{4})\s*
      (\S+)
    )/x

    m = lexer.match(line)
    begin
      if (m) then ...

Edit Here you can find some of the lines that I need to parse: File

Edit II @Mike R

I'm parsing a file that contains 25 columns per line and each column might have it's own way of validation. Either it could be whitespace or a full char-set.

  • That validation is required since I have to drop away the line that doesn't match that kind of part.
  • Might not be necessary
  • It's necessary

I don't believe that the expression it's badly constructed, the lookahead it's used, maybe in the part that I repeated 开发者_C百科the code (I just don't remembered the capturing group index \1...\n, if this is what you mean!) I also believe that catastrophic backtracking is happening here.

If you see the file, maybe you'll understand why I'm doing this! Let's put as an example the first column. I have to match a "Part Number" and I don't have any rule of how to do this, examples:

  • 123456789
  • 1 555 989
  • 0123456789123456789

Neither a simple \S+, or (\S+\s){1, } could solve this problem, Cause I won't be guaranting data integrity.

Ty!

Any improvement, suggestion?

~ Eder Quiñones


  1. The regexp doesn't change in the loop, so pull the lexer = assignment up and out of the loop. ("Loop invariant code motion.")

  2. I'm not sure if this will be faster or slower but the repeated subexpressions could just be a single outer repeating group.

  3. So this is obvious, I'm sure, but it might make sense to write a real parser. Ruby has an advanced PEG parser generator called Treetop. The traditional approaches are either Lex + Yacc (such as flex + bison) or just a C program implementing a scanner and a recursive-descent parser. Old-school, yes, but easily written and kind of a fun change from yet another CRUD program.

  4. I agree with the other responders that the expression is unnecessarily redundant, complicated and should probably be broken up into at least two separate and smaller ones. Two expressions could be interesting because they could both be oppositely anchored.


You're probably getting catastrophic backtracking.

Namely, but not only, because of your greedy \S+\s* sequences. They're likely invalid too, since:

(\S+)\s*
(\S+)\s*
(\S+)\s*
(\S+)\s*

would match 'abcd' (after a painful amount of backtracking).

Also, as point out in the other answer, seek to compile that regex definition out of your loop.


Your file is a format with fixed-width fields. Ruby has a string method called unpack that is specifically for parsing this type of file.

field_widths = [19,41,14,11,11,11,11,11,11,11] #etc
field_pattern = "A#{fields.join('A')}"

Then in your line loop:

row = line.unpack(field_pattern)

Now you have an array (row) with the contents of each field. You can then apply a regex to each one for validation. This is faster, more manageable, and also allows for field-specific error messages.


What are you actually looking for in the parsing?

As far as I can tell, there are only a few significant elements that you're looking for in the line.

  • You're looking for a alpha-numeric that starts with Z.
    (Z(?:RO[A-DHJ]|EQ[A-C]|HIB|PRO|PRP|RMA)|H(?:IB[2E]|ALB)|F(?:ER[2T]|LUP2|ST4Q))\s*

  • You're looking for a 6 digit number
    (\d{6})\s*

  • You're looking for a date
    (?=((?:\d{2}\/){2}\d{4}))\s*
    ((?:\d{2}\/){2}\d{4})\s*

(Note that the last date expression is an example of how badly constructed this is. It has a lookahead for a date, immediately followed by a non-capturing match of that exact same date. The look-ahead is therefore, meaningless. See the other answer on catastrophic backtracking, which is undoubtedly happening here.)

Pretty much everything else in that expression could match anything/nothing that I find it hard to believe they're expressing rules of any meaningful form. Real data just doesn't vary that much, or it varies with some precise rule that could be expressed in one (or more regexes).

So, junk the expression completely and stop trying to do it all at once. Do individual pieces (splitting the line on \s as necessary) and break this up into a several small regex'es that actually match what you really need to verify.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜