Using regular expressions in python to determine C++ functions and their parameters
So I'm doing something wrong in this python script, but it's becoming convoluted and I'm losing sight of what I'm doing wrong.
I want a script to go through a file, find all the function definitions, and then pull out the name, return type, and parameters of the function, and output a "doxygen" style comment like this:
/******************************************************************************/
/*!
\brief
Main function for the file
\return
The exit code for the program
*/
/******************************************************************************/
But I'm doing something wrong with the regular expression in trying to parse the parameters... Here is the script so far:
import re
import sys
f = open(sys.argv[1])
functions = []
for line in f:
match = re.search(r'([\w]+)\s+([\S]+)\(([\w+\s+\w+])+\)',line)
if line.find("\\fn") < 0:
if match:
returntype = match.group(1)
funcname = match.group(2)
print '/********************************************************************'
print " \\fn " + match.group()
print ''
print ' \\brief'
print ' Function description for ' + funcname
print ''
if len(match.groups()) > 2:
params = []
count = len(match.groups()) - 2
while count > 0:
matchingstring = match.group(count + 2)
if matchingstring.find("void") < 0:
params.append(matchingstring)
count -= 1
for parameter in params:
print " \\param " + parameter
开发者_如何学JAVA print ' Description of ' + parameter
print ''
print ' \\return'
print ' ' + returntype
print '********************************************************************/'
print ''
Any help would be appreciated. Thanks
The grammar of C++ is far too complex to be handled by simple
regular expressions. You'll need at least a minimal parser.
I've found that for restricted cases, where I'm not concerned
with C++ in general, but only my own style, I can often get away
with a flex based tokenizer and a simple state machine. This
will fail in many cases of legal C++—for starters, of
course, if someone uses the pre-processor to modify the syntax;
but also because <
can have different meanings, depending on
what precedes it names a template or not. But it's often
adequate for a specific job.
I've used a PEG parser with great success when trying to do simple format parsing. pyPeg is a very simple implementation of such a parser written in Python.
Example Python code for C++ function parser:
EDIT: Address template parameters. Tested with input from SK-logic and output is correct.
import pyPEG
from pyPEG import parseLine
import re
def symbol(): return re.compile(r"[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ&*][\w:]+")
def type(): return symbol
def functionName(): return symbol
def templatedType(): return symbol, "<", -1, [templatedType, symbol, ","], ">"
def parameter(): return [templatedType, type], symbol
def template(): return "<", -1, [symbol, template], ">"
def function(): return [type, templatedType], functionName, -1, template, "(", -1, [",", parameter], ")" # -1 -> zero or more repetitions.
sourceCode = "std::string foobar(std::vector<int> &A, std::map<std::string, std::vector<std::string> > &B)"
results = parseLine(sourceCode, function(), [], packrat=True)
When this is executed results is:
([(u'type', [(u'symbol', 'std::string')]), (u'functionName', [(u'symbol', 'foobar')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'int')]), (u'symbol', '&A')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::map'), (u'symbol', 'std::string'), (u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'std::string')])]), (u'symbol', '&B')])], '')
C++ cannot really be parsed by a (sane) regular expression: they are a nightmare as soon as nesting is concerned.
There is another concern too, determining when to parse and when not to. A function may be declared:
- at file scope
- in a namespace
- in a class
And the two last can be nested at arbitrary depths.
I would propose to use CLang here. It's a real C++ front-end with a full-featured parser and there are:
- a C API, with (notably) an API to the Indexing Library
- Python bindings on top of the C API
The C API and Python bindings are far from fully exposing the underlying C++ model, but for a task as simple as listing functions it should be enough.
That said, I would question the usefulness of the project: if the documentation can be generated by a simple parser, then it is redundant with the code. And redundancy is at best, useless, and worst dangerous: it introduces the potential threat of desynchronization...
If the function is tricky enough that its use requires documentation, then a developer, who knows the limitations and al, has to write this documentation.
精彩评论