开发者

Using regular expressions in python to determine C++ functions and their parameters

So I'm doing something wrong in this python script, but it's becoming convoluted and I'm losing sight of what I'm doing wrong.

I want a script to go through a file, find all the function definitions, and then pull out the name, return type, and parameters of the function, and output a "doxygen" style comment like this:

/******************************************************************************/
  /*!
    \brief
      Main function for the file

    \return
      The exit code for the program
  */
/******************************************************************************/

But I'm doing something wrong with the regular expression in trying to parse the parameters... Here is the script so far:

import re
import sys

f = open(sys.argv[1])

functions = []

for line in f:
  match = re.search(r'([\w]+)\s+([\S]+)\(([\w+\s+\w+])+\)',line)
  if line.find("\\fn") < 0:
    if match:
      returntype = match.group(1)
      funcname = match.group(2)
      print '/********************************************************************'
      print "  \\fn " + match.group()
      print ''
      print '  \\brief'
      print '    Function description for ' + funcname
      print ''
      if len(match.groups()) > 2:
        params = []
        count = len(match.groups()) - 2
        while count > 0:
          matchingstring = match.group(count + 2)
          if matchingstring.find("void") < 0:
            params.append(matchingstring)
          count -= 1
        for parameter in params:
          print "  \\param " + parameter
  开发者_如何学JAVA        print '    Description of ' + parameter
          print ''
      print '  \\return'
      print '    ' + returntype
      print '********************************************************************/'
      print ''

Any help would be appreciated. Thanks


The grammar of C++ is far too complex to be handled by simple regular expressions. You'll need at least a minimal parser. I've found that for restricted cases, where I'm not concerned with C++ in general, but only my own style, I can often get away with a flex based tokenizer and a simple state machine. This will fail in many cases of legal C++—for starters, of course, if someone uses the pre-processor to modify the syntax; but also because < can have different meanings, depending on what precedes it names a template or not. But it's often adequate for a specific job.


I've used a PEG parser with great success when trying to do simple format parsing. pyPeg is a very simple implementation of such a parser written in Python.

Example Python code for C++ function parser:

EDIT: Address template parameters. Tested with input from SK-logic and output is correct.

import pyPEG
from pyPEG import parseLine
import re

def symbol(): return re.compile(r"[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ&*][\w:]+")
def type(): return symbol
def functionName(): return symbol
def templatedType(): return symbol, "<", -1, [templatedType, symbol, ","], ">"
def parameter(): return [templatedType, type], symbol
def template(): return "<", -1, [symbol, template], ">"
def function(): return [type, templatedType], functionName, -1, template, "(", -1, [",", parameter], ")" # -1 -> zero or more repetitions.


sourceCode = "std::string foobar(std::vector<int> &A, std::map<std::string, std::vector<std::string> > &B)"
results = parseLine(sourceCode, function(), [], packrat=True)

When this is executed results is:

([(u'type', [(u'symbol', 'std::string')]), (u'functionName', [(u'symbol', 'foobar')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'int')]), (u'symbol', '&A')]), (u'parameter', [(u'templatedType', [(u'symbol', 'std::map'), (u'symbol', 'std::string'), (u'templatedType', [(u'symbol', 'std::vector'), (u'symbol', 'std::string')])]), (u'symbol', '&B')])], '')


C++ cannot really be parsed by a (sane) regular expression: they are a nightmare as soon as nesting is concerned.

There is another concern too, determining when to parse and when not to. A function may be declared:

  • at file scope
  • in a namespace
  • in a class

And the two last can be nested at arbitrary depths.

I would propose to use CLang here. It's a real C++ front-end with a full-featured parser and there are:

  • a C API, with (notably) an API to the Indexing Library
  • Python bindings on top of the C API

The C API and Python bindings are far from fully exposing the underlying C++ model, but for a task as simple as listing functions it should be enough.


That said, I would question the usefulness of the project: if the documentation can be generated by a simple parser, then it is redundant with the code. And redundancy is at best, useless, and worst dangerous: it introduces the potential threat of desynchronization...

If the function is tricky enough that its use requires documentation, then a developer, who knows the limitations and al, has to write this documentation.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜