Java - regular expression finding comments in code
A little fun with Java this time. I want to write a program that reads a code from standard input (line by line, for example), like:
// some comment
class Main {
    /* blah */
    // /* foo
    foo();
    // foo */
    foo2();
    /* // foo2 */
}
finds all comments in it and removes them. I'm trying to use regular expressions, and for now I've done something like this:
private static String ParseCode(String pCode)
{
  开发者_运维问答  String MyCommentsRegex = "(?://.*)|(/\\*(?:.|[\\n\\r])*?\\*/)";
    return pCode.replaceAll(MyCommentsRegex, " ");
}
but it seems not to work for all the cases, e.g.:
System.out.print("We can use /* comments */ inside a string of course, but it shouldn't start a comment");
Any advice or ideas different from regex? Thanks in advance.
You may have already given up on this by now but I was intrigued by the problem.
I believe this is a partial solution...
Native regex:
//.*|("(?:\\[^"]|\\"|.)*?")|(?s)/\*.*?\*/
In Java:
String clean = original.replaceAll( "//.*|(\"(?:\\\\[^\"]|\\\\\"|.)*?\")|(?s)/\\*.*?\\*/", "$1 " );
This appears to properly handle comments embedded in strings as well as properly escaped quotes inside strings. I threw a few things at it to check but not exhaustively.
There is one compromise in that all "" blocks in the code will end up with space after them. Keeping this simple and solving that problem would be very difficult given the need to cleanly handle:
int/* some comment */foo = 5;
A simple Matcher.find/appendReplacement loop could conditionally check for group(1) before replacing with a space and would only be a handful of lines of code. Still simpler than a full up parser maybe. (I could add the matcher loop too if anyone is interested.)
The last example is no problem I think:
/* we comment out some code
System.out.print("We can use */ inside a string of course");
we end the comment */
... because the comment actually ends with "We can use */. This code does not compile.
But I have another problematic case:
int/*comment*/foo=3;
Your pattern will transform this into:
intfoo=3;
...what is invalid code. So better replace your comments with " " instead of "".
I think a 100% correct solution using regular expressions is either inhuman or impossible (taking into account escapes, etc.).
I believe the best option would be using ANTLR- I believe they even provide a Java grammar you can use.
I ended up with this solution.
public class CommentsFun {
    static List<Match> commentMatches = new ArrayList<Match>();
    public static void main(String[] args) {
        Pattern commentsPattern = Pattern.compile("(//.*?$)|(/\\*.*?\\*/)", Pattern.MULTILINE | Pattern.DOTALL);
        Pattern stringsPattern = Pattern.compile("(\".*?(?<!\\\\)\")");
        String text = getTextFromFile("src/my/test/CommentsFun.java");
        Matcher commentsMatcher = commentsPattern.matcher(text);
        while (commentsMatcher.find()) {
            Match match = new Match();
            match.start = commentsMatcher.start();
            match.text = commentsMatcher.group();
            commentMatches.add(match);
        }
        List<Match> commentsToRemove = new ArrayList<Match>();
        Matcher stringsMatcher = stringsPattern.matcher(text);
        while (stringsMatcher.find()) {
            for (Match comment : commentMatches) {
                if (comment.start > stringsMatcher.start() && comment.start < stringsMatcher.end())
                    commentsToRemove.add(comment);
            }
        }
        for (Match comment : commentsToRemove)
            commentMatches.remove(comment);
        for (Match comment : commentMatches)
            text = text.replace(comment.text, " ");
        System.out.println(text);
    }
    //Single-line
    // "String? Nope"
    /*
    * "This  is not String either"
    */
    //Complex */
    ///*More complex*/
    /*Single line, but */
    String moreFun = " /* comment? doubt that */";
    String evenMoreFun = " // comment? doubt that ";
    static class Match {
        int start;
        String text;
    }
}
Another alternative is to use some library supporting AST parsing, for e.g. org.eclipse.jdt.core has all the APIs you need to do this and more. But then that's just one alternative:)
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论