Regular Expressions to find C++ elements?
I'm looking for some predefined Regexes for elements of ANSI C++.
I would like to create a program which takes a headerfile (with includes, namespaces, classes etc) as input and returns lists with the found classnames, methods, attributes etc.
Its hard to google for something like that, I always end up with tutorials of how to开发者_Go百科 use Regexes in C++. Perhaps I'm just googling the wrong terms? Perhaps someone already has found/used/created such Regexes.
This type of operation is not possible to do with a regular expression. C++ is not a regular language and hence can't be reliably parsed with a regular expression. The safest approach here is to use an actual parser here to locate C++ elements.
If 100% correctness is not a goal though then a regular expression will work because it can be crafted to catch the majority of cases within a code base. The simplest example would be the following
class\s+[a-z]\w+
However it will incorrectly match the following as a class
- Forward declarations
- Any string literal with text like "class foo"
- Template parameters
- etc ...
You might find the code for ctags handy. It will parse code and break out the symbols for use in emacs and other programs. In fact, it might just do all the work you are trying to do yourself.
You may also find something interesting in ctags or cscope as already mentioned. I also have encountered flist here
I'm writing a Python program to extract some essential class info from a large messy C++ source tree. I'm having pretty good luck with using regexes. Fortunately, nearly all the code follows a style that lets me get away with defining just a few regexes to detect class declarations, methods, etc. Most member variables have names like "itsSomething_" or "m_something". I kludge in hard-coded hackwork to catch anything not fitting the style.
class_decl_re = re.compile( r"^class +(\w+)\s*(:|\{)" )
close_decl_re = re.compile( r"^\};" )
method_decl_re = re.compile( r"(\w[ a-zA-Z_0-9\*\<\>]+) +(\w+)\(" )
var_decl1_re = re.compile( r"(\w[ a-zA-Z_0-9\*\<\>]+) +(its\w+);" )
var_decl2_re = re.compile( r"(\w[ a-zA-Z_0-9\*\<\>]+) +(m_\w+);" )
comment_pair_re = re.compile( r"/\*.*\*/" )
This is a work in progress, but I'll show this (possibly buggy) (no, almost certainly buggy) snip of code to show how the regexes are used:
# at this point, we're looking at one line from a .hpp file
# from inside a class declaration. All initial whitespace has been
# stripped. All // and /*...*/ comments have been removed.
is_static = (line[0:6]=="static")
if is_static:
line=line[6:]
is_virtual = (line[0:7]=="virtual")
if is_virtual:
line=line[7:]
# I believe "virtual static" is impossible, but if our goal
# is to detect such coding gaffes, this code can't do it.
mm = method_decl_re.match(line)
vm1 = var_decl1_re.match(line)
vm2 = var_decl2_re.match(line)
if mm:
meth_name = mm.group(2)
minfo = MethodInfo(meth_name, classinfo.name) # class to hold info about a method
minfo.rettype = mm.group(1) # return type
minfo.is_static = is_static
if is_static:
if is_virtual:
classinfo.screwed_up=True
classinfo.class_methods[meth_name] = minfo
else:
minfo.is_polymorphic = is_virtual
classinfo.obj_methods[meth_name] = minfo
elif vm1 or vm2:
if vm1: # deal with vars named "itsXxxxx..."
vm=vm1
var_name = vm.group(2)[3:]
if var_name.endswith("_"):
var_name=var_name[:-1]
else: # deal with vars named "m_Xxxxx..."
vm=vm2
var_name = vm.group(2)[2:] # remove the m_
datatype = vm.group(1)
vi = VarInfo(var_name, datatype)
vi.is_static = is_static
classinfo.vars[var_name] = vi
I hope this is easy to understand and translate to other languages, at least for a starting point for anyone crazy enough to try. Use at your own risk.
精彩评论