开发者

How can I de-comment JavaScript code with this preg_replace?

I'm trying to decomment my // comments in my javascript with php preg_replace() and made a preg_replace which should do following:

1.When a comment start on a new line, delete that entire line: // COMMENTS .....

2.When comment is halfway behind a script, after 1 TAB // remove that comment part exampleScript(); // (1space) comments

3.Don't match the // in http://

This pregreplace does the above job, HOWEVER, it currently removes 3 lines of code with // in it. (see the false matches header below) which it should skip.

$buffer = preg_replace('/(?<!http:)\/\/\s*[^\r\n]*/', '', $buffer);

good matches

//something

// something *!&~@#^hjksdhaf

function();// comment

false matches

(/\/\.\//)
"//"  
"://"  

So, How can I filter these three false matches out and how to change the below regex?

(?<!http:)\/\/\s*[^\r\n]*

PS, I don't wish to u开发者_如何学编程se others' code minifiers/frameworks with their own overheads. Just my own for now.


Why not use a preexisting JavaScript minifier, like the YUI Compressor (PHP bindings here)?


If you are really set on writing your own, have a look through the source code to see how it's done.
Short version: The Right Way is to use a proper parser/tokenizer approach.


The grammar of JavaScript is a context-free grammar (I believe it's LL(1)-parseable). It cannot be parsed with regular expressions.

In the theory of formal languages in computability theory, there is a result known as the pumping lemma which proves that you cannot parse arbitrary context-free grammars with a regular expression.

The gist of the problem is this: you can't just look for the string //, because it could be contained inside otherwise valid code, for example, a string. You can't just look for a // inside two quotation marks, because then you'd get false positives like alert('no!') // can't do it where the text ) // can is technically contained between two ' marks. Instead, you'd have to detect where strings begin and end. Worse, one type of strings can be nested inside another type of strings, and strings (even half-open strings) can be nested inside of comments!

There is no simple general solution -- JavaScript syntactic elements like strings, brackets, parentheses, etc., can be nested arbitrarily many levels deep. The only way to accurately detect where any syntactic element begins and ends is to correctly parse all the syntactic elements that you might encounter along the way.

The correct answer is to use an actual parser.


$buffer = preg_replace('/(?<!\S)\/\/\s*[^\r\n]*/', '', $buffer);

Works on all of the instances mentioned in the question: keeps the positive matches, removes the false matches.

Three awesome websites on the net that help with finding the correct regex:

http://gskinner.com/RegExr/

http://lumadis.be/regex/test_regex.php

http://cs.union.edu/~hannayd/csc350/simulators/RegExp/reg.htm

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜