Building a Regex Based Parser [closed]
Is it stupid to build a regex based parser?
Matching nested parens is exceedingly simple using modern patterns. Not counting whitespace, this sort of thing:
\( (?: [^()] *+ | (?0) )* \)
works for mainstream languages like Perl and PHP, plus anything that uses PCRE.
However, you really need grammatical regexes for a full parse, or you’ll go nuts. Don’t use a language whose regexes don’t support breaking regexes down into smaller units, or which don’t support proper debugging of their compilation and execution. Life’s too short for low-level hackery. Might as well go back to assembly language if you’re going to do that.
I’ve written about recursive patterns, grammatical patterns, and parsing quite a bit: for example, see here for parsing approaches and here for lexer approaches; also, the final solution here.
Also, Perl’s Regexp::Grammars
module is especially useful in turning grammatical regexes into parsing structures.
So by all means, go for it. You’ll learn a lot that way.
For work? Yes. For learning? No.
The allure of parsing your own little languages with regular expressions cannot be overstated: most sysadmins could write a simple language parser entirely in Perl very quickly, but parsing the same language with lex/yacc would take most programmers a few hours.
And the Perl version would probably just about do the job. But as gpvos points out, using regex backend for your parsing drastically reduces future enhancement options, and sometimes attempts to work around the limitations leads to some pretty awful code, when it would be easy to handle those general enhancements with table-driven tools or hand written recursive descent parsers.
If you know the language is always going to remain easily parse-able with regex, you might do the right thing by spending an hour to get the job done, rather than four or five re-learning lex and yacc enough to write a similar parser with stronger tools. But if the language is liable to grow or change much, using real parser generators will probably help in the long run.
It depends on what you want to parse, but IMO for most of the practical cases the answer is "No". Regex are quite limited on the grammar they can recognize (the limits being set by the regex implementation, as everybody put their own spice on it)
As you stated in your comments that you're building a parser for VBScript, forget about regexes as you need to recognize a Context Free Grammar. Check GOLD Parser or ANTLR.
Often, regexes are used for the lexer (the recognizing of tokens), and something more powerful such as a recursive descent parser is used for recognizing the sequences of tokens, i.e., the actual parsing.
For very simple languages, a regex could be enough, but you would be limiting yourself very much. For example, you cannot parse an expression like (1 + 2) * 3 - 4
using a regex.
Have a look at the GoldParser. It allows the use of regular expression for finding the tokens.
精彩评论