Regex implementation with event driven matches?

2023-01-27 21:31 问答作者：

This may sound a little odd, but it would be extremely useful to me. Are there any regex implementations (any language but preferably java, javascript, c, c++) that use an event based model for matches?

I would like to be able to register a bunch of different regular expressions I am looking for in a string via an event based model, feed the string though the regex engine, 开发者_运维知识库and just have the events fired off correctly. Does anything like this exist?

I realize this is bordering on the territory of a heavy duty lexer/parser, but I would prefer to stay away from that if at all possible, as my search expressions would need to be dynamic (completely).

Thanks

This is very easy to do in Perl regular expressions. All you do is insert your event callouts at the appropriate point in the pattern in the most straightforward manner imaginable.

First, imagine a pattern for pulling out decimal numbers from string:

my $rx0 = /[+-]?(?:\d+(?:\.\d*)?|\.\d+)/;

Let’s expand that out so we can insert our callouts:

my $rx1 = qr{
    [+-] ?
    (?: \d+
        (?: \. \d* ) ?
      |
        \. \d+
    )
}x;

For callouts, I’ll just print some debugging, but you could do anything you want:

my $rx2 = qr{
    (?: [+-]                (?{ say "\tleading sign"                })
    ) ?
    (?: \d+                 (?{ say "\tinteger part"                })
        (?: \.              (?{ say "\tinternal decimal point"      })
            \d*             (?{ say "\toptional fractional part"    })
        ) ?
      |
        \.                  (?{ say "\tleading decimal point"       })
        \d+                 (?{ say "\trequired fractional part"    })
    )                       (?{ say "\tsuccess"                     })
}x;

Here’s the whole demo:

use 5.010;
use strict;

use utf8;

my $rx0 = qr/[+-]?(?:\d+(?:\.\d*)?|\.\d+)/;

my $rx1 = qr{
    [+-] ?
    (?: \d+
        (?: \. \d* ) ?
      |
        \. \d+
    )
}x;

my $rx2 = qr{
    (?: [+-]                (?{ say "\tleading sign"                })
    ) ?
    (?: \d+                 (?{ say "\tinteger part"                })
        (?: \.              (?{ say "\tinternal decimal point"      })
            \d*             (?{ say "\toptional fractional part"    })
        ) ?
      |
        \.                  (?{ say "\tleading decimal point"       })
        \d+                 (?{ say "\trequired fractional part"    })
    )                       (?{ say "\tsuccess"                     })
}x;

my $string = <<'END_OF_STRING';

    The Earth’s temperature varies between
    -89.2°C and 57.8°C, with a mean of 14°C.

    There are .25 quarts in 1 gallon.

    +10°F is -12.2°C.

END_OF_STRING

while ($string =~ /$rx2/gp) {
    printf "Number: ${^MATCH}\n";
}

which when run produces this:

        leading sign
        integer part
        internal decimal point
        optional fractional part
        success
Number: -89.2
        integer part
        internal decimal point
        optional fractional part
        success
Number: 57.8
        integer part
        success
Number: 14
        leading decimal point
        leading decimal point
        required fractional part
        success
Number: .25
        integer part
        success
Number: 1
        leading decimal point
        leading sign
        integer part
        success
Number: +10
        leading sign
        integer part
        internal decimal point
        optional fractional part
        success
Number: -12.2
        leading decimal point

You may want to arrange a more grammatical regular expression for maintainability. This also helps for when you want to make a recursive descent parser out of it. (Yes, of course you can do that: this is Perl, after all. :)

Look at the last solution in this answer for what I mean by grammatical regexes. I also have larger examples elsewhere here on SO.

But it sounds like you should look at the Regexp::Grammars module by Damian Conway, which was built for just this sort of thing. This question talks about it, and has a link to the module proper.

You might want to check out PIRE - a very fast automata-based regexp engine, tuned to match zillions of lines of text against many regular expressions quickly. It's available in C and has some bindings.

It's really not something that's too hard to put together yourself if you can't find any existing library.

Something like this:

public class RegexNotifier {
   private final Map<Pattern, List<RegexListener>> listeners = new HashMap<Pattern, List<RegexListener>>();

   public synchronized void register(Pattern pattern, RegexListener listener) {
      List<RegexListener> list = listeners.get(pattern);
      if (list == null) {
         list = new ArrayList<RegexListener>();
         listeners.put(pattern, list);
      }
      list.add(listener);
   }

   public void process(String input) {
      for (Entry<Pattern, List<RegexListener>> entry : listeners.entrySet()) {
         if (entry.getKey().matcher(input).matches()) {
            for (RegexListener listener : entry.getValue()) {
               listener.stringMatched(input, entry.getKey());
            }
         }
      }
   }
}

interface RegexListener {
   public void stringMatched(String matched, Pattern pattern);
}

The only shortcoming I see with this is that Pattern doesn't implement hashCode() and equals(), meaning it will be less than optimal if equal patterns using different instances are used. ~~But that usually doesn't happen because the factory method Pattern.compile() is good about caching patterns.~~

继续阅读：c javascript regex

Regex implementation with event driven matches?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？