开发者

Detecting regular expression in content during parse

I am writing a simple parser for C. I was just running it with some other language files (for fun - to see the extent of C-likeness and laziness - don't wanna really write separate parsers for each language if I can avoid it).

However the parser seems to break down for JavaScript if the code being parsed contains regular expressions...

Case 1: For example, while parsing the JavaScript code snippet,

var phone="(304)434-5454"
phone=phone.replace(/[\(\)-]/g, "") 
//R开发者_运维百科eturns "3044345454" (removes "(", ")", and "-")

The '(', '[' etc get matched as starters of new scopes, which may never be closed.

Case 2: And, for the Perl code snippet,

 # Replace backslashes with two forward slashes
 # Any character can be used to delimit the regex
 $FILE_PATH =~ s@\\@//@g; 

The // gets matched as a comment...

How can I detect a regular expression within the content text of a "C-like" program-file?


It is impossible.

Take this, for example:

m =~ s/a/b/g;

Could be both C or perl.

One minute's thinking reveals, that the number of perl style regular expressions that are also sntyctically valid C expressions is infinite.

Another example:

m+foo *bar[index]+i

The best you can get is some extreme vague guesswork. The difficulty stems from the fact that a regular expression is a sequence of characters that can be virtually everything.

You better clean up your error handling. A parser should not "break down" if some parenthesis are missing or superfluous ones are seen.


Well, your token grammar has to take regex syntax into consideration. Classic parsers consist of two layers: something to tokenize the input, and then something to parse the grammar. The syntax of the language is generally expressed in terms of tokens, so the job of the tokenizer is to feed a stream of those to the parser. Generally the tokens them selves are regular expressions, or more properly a great big regex of things ORed together. At each character position on the input, one of the token regexes must match or else the character is invalid.

Now, there are other parsing techniques that sort-of squish together the tokenization with the parsing. ("PEG" parsers for example)

edit — another note: you can't parse languages like Javascript or Perl with just a regular expression.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜