Removing C++ Comment From Source Code
I have some c++ code with /* */
and //
style comments. I want to have a way to remove them all automatically. Apparently, using an editor (e.g. ultraedit) with some regexp searching for /*
, */
and //
should do the job. But, on a closer look, a complete solution isn't that simple because the sequences /* or // may not represent a comment if they're inside another comment, string literal or character literal. e.g.
printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
is a comment sequence inside a double quote. And, it isn't a simple task to determine if a string is inside a pair of valid double-quotes. While this
// this is a single line comment /* <--- this does not start a comment block
// this is a second comment line with an */ within
is comment sequences inside other comments.
Is there a more comprehensive way to remove comments from a C++ source taking into account string literal and comment? For example can we instruct the preprocessor to remove comments 开发者_开发知识库while doesn't carry out, say, #include directive?
The C pre-processor can remove the comments.
Edited:
I have updated so that we can use the MACROS to expand the #if statements
> cat t.cpp
/*
* Normal comment
*/
// this is a single line comment /* <--- this does not start a comment block
// this is a second comment line with an */ within
#include <stdio.h>
#if __SIZEOF_LONG__ == 4
int bits = 32;
#else
int bits = 16;
#endif
int main()
{
printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
/*
* comment with a single // line comment enbedded.
*/
int x;
// A single line comment /* Normal enbedded */ Comment
}
Because we want the #if statements to expand correctly we need a list of defines.
That's relatively trivial. cpp -E -dM
.
Then we pipe the #defines and the original file back through the pre-processor but prevent the includes from being expanded this time.
> cpp -E -dM t.cpp > /tmp/def
> cat /tmp/def t.cpp | sed -e s/^#inc/-#inc/ | cpp - | sed s/^-#inc/#inc/
# 1 "t.cpp"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "t.cpp"
#include <stdio.h>
int bits = 32;
int main()
{
printf(" \" \" " " /* this is not a comment and is surrounded by an unknown number of double-quotes */");
int x;
}
Our SD C++ Formatter has an option to pretty print the source text and remove all comments. It uses our full C++ front end to parse the text, so it is not confused by whitespace, line breaks, string literals or preprocessor issues, nor will it break the code by its formatting changes.
If you are removing comments, you may be trying to obfuscate the source code. The Formatter also comes in an obfuscating version.
May someone vote up my own answer to my own question.
Thanks to Martin York's idea, I found that in Visual Studio, the solution looks very simple (subject to further testing). Just rename ALL preprocessor directives to something else, (something that is not valid c++ syntax is ok) and use the cl.exe with /P
cl target.cpp /P
and it produces a target.i
. And it contains the source minus the comments. Just rename the previous directives back and there you go. Probably you will need to remove the #line
directive generated by cl.exe.
This works because according to MSDN, the phases of translation is this:
Character mapping Characters in the source file are mapped to the internal source representation. Trigraph sequences are converted to single-character internal representation in this phase.
Line splicing All lines ending in a backslash () and immediately followed by a newline character are joined with the next line in the source file forming logical lines from the physical lines. Unless it is empty, a source file must end in a newline character that is not preceded by a backslash.
Tokenization The source file is broken into preprocessing tokens and white-space characters. Comments in the source file are replaced with one space character each. Newline characters are retained.
Preprocessing Preprocessing directives are executed and macros are expanded into the source file. The #include statement invokes translation starting with the preceding three translation steps on any included text.
Character-set mapping All source character set members and escape sequences are converted to their equivalents in the execution character set. For Microsoft C and C++, both the source and the execution character sets are ASCII.
String concatenation All adjacent string and wide-string literals are concatenated. For example, "String " "concatenation" becomes "String concatenation".
Translation All tokens are analyzed syntactically and semantically; these tokens are converted into object code.
Linkage All external references are resolved to create an executable program or a dynamic-link library
Comments are removed during Tokenization prior to Preprocessing phase. So just make sure during the preprocessing phase, nothing is available for processing (removing all the directives) and its output should be just those processed by the previous 3 phases.
As to the user-defined .h files, use the /FI option to manually include them. The resultant .i file will be a combination of the .cpp and .h. without comments. Each piece is preceded by a #line with the proper filename. So it is easy to split them up by an editor. If we don't want to manually split them up, probably we need to use the macro/scripting facility of some editors to do it automatically.
So, now, we don't have to care about any of the preprocessor directives. Even better is line continuation character (backslash) is handled.
e.g.
// vc8.cpp : Defines the entry point for the console application.
//
-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR
/* comment here */
whatever error line is ok
-#else
some error line if NOERR not defined
// comment here
-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
pr();
return 0;
}
/*comment*/
void pr() {
printf(" /* "); /* comment inside string " */
// comment terminated by \
continue a comment line
printf(" "); /** " " string inside comment */
printf/* this is valid comment within line continuation */\
("some weird lines \
with line continuation");
}
After cl.exe vc8.cpp /P
, it becomes this, and can then be fed to cl.exe again after restoring the directives (and removing the #line)
#line 1 "vc8.cpp"
-#include "stdafx.h"
-#include <windows.h>
-#define NOERR
-#ifdef NOERR
whatever error line is ok
-#else
some error line if NOERR not defined
-#endif
void pr() ;
int _tmain(int argc, _TCHAR* argv[])
{
pr();
return 0;
}
void pr() {
printf(" /* ");
printf(" ");
printf\
("some weird lines \
with line continuation");
}
You can use a rule-based parser (e.g. boost::spirit) to write syntax rules for comments. You will need to decide whether to process nested comments or not depending on your compiler. Semantic actions removing comments should be pretty straightforward.
Regex are not meant to parse languages, it's a frustrating attempt at best.
You actually need a full-blown parser for this. You might wish to consider Clang
, rewriting is an explicit goal of the Clang libraries suite and there are already existing rewriters implemented that you could get inspiration from.
#include <iostream>
#include<fstream>
using namespace std;
int main() {
ifstream fin;
ofstream fout;
fin.open("input.txt");
fout.open("output.txt");
char ch;
while(!fin.eof()){
fin.get(ch);
if(ch=='/'){
fin.get(ch);
if(ch=='/' )
{ //cout<<"Detected\n";
fin.get(ch);
while(!(ch=='\n'||ch=='\0'))
{
//cout<<"while";
fin.get(ch);
}
}
if(ch=='*')
{
fin.get(ch);
while(!(ch=='*')){
fin.get(ch);
}
fin.get(ch);
if(ch=='/'){
// cout<<"Detected Multi-Line\n";
fin.get(ch);
}
}
}
fout<<ch;
}
return 0;
}
精彩评论