Python AST with preserved comments
I can get AST without comments using
import ast
module = ast.parse(open('/path/to/module.py').r开发者_StackOverflow社区ead())
Could you show an example of getting AST with preserved comments (and whitespace)?
The ast
module doesn't include comments. The tokenize
module can give you comments, but doesn't provide other program structure.
An AST that keeps information about formating, comments etc. is called a Full Syntax Tree.
redbaron is able to do this. Install with pip install redbaron
and try the following code.
import redbaron
with open("/path/to/module.py", "r") as source_code:
red = redbaron.RedBaron(source_code.read())
print (red.fst())
This question naturally arises when writing any kind of Python code beautifier, pep-8 checker, etc. In such cases, you are doing a source-to-source transformations, you do expect the input to be written by human and not only want the output to be human-readable, but in addition expect it to:
- include all comments, exactly where they appear in the original.
- output the exact spelling of strings, including docstrings as in the original.
This is far from easy to do with the ast module. You could call it a hole in the api, but there seems to be no easy way to extend the api to do 1 and 2 easily.
Andrei's suggestion to use both ast and tokenize together is a brilliant workaround. The idea came to me also when writing a Python to Coffeescript converter, but the code is far from trivial.
The TokenSync
(ts) class starting at line 1305 in py2cs.py coordinates communication between the token-based data and the ast traversal. Given the source string s, the TokenSync
class tokenizes s and inits internal data structures that support several interface methods:
ts.leading_lines(node)
: Returns a list of the preceding comment and blank lines.
ts.trailing_comment(node)
: Return a string containing the trailing comment for the node, if any.
ts.sync_string(node)
: Return the spelling of the string at the given node.
It is straightforward, but just a bit clumsy, for the ast visitors to use these methods. Here are some examples from the CoffeeScriptTraverser
(cst) class in py2cs.py:
def do_Str(self, node):
'''A string constant, including docstrings.'''
if hasattr(node, 'lineno'):
return self.sync_string(node)
This works provided that ast.Str nodes are visited in the order they appear in the sources. This happens naturally in most traversals.
Here is the ast.If visitor. It shows how to use ts.leading_lines
and ts.trailing_comment
:
def do_If(self, node):
result = self.leading_lines(node)
tail = self.trailing_comment(node)
s = 'if %s:%s' % (self.visit(node.test), tail)
result.append(self.indent(s))
for z in node.body:
self.level += 1
result.append(self.visit(z))
self.level -= 1
if node.orelse:
tail = self.tail_after_body(node.body, node.orelse, result)
result.append(self.indent('else:' + tail))
for z in node.orelse:
self.level += 1
result.append(self.visit(z))
self.level -= 1
return ''.join(result)
The ts.tail_after_body
method compensates for the fact that there are no ast nodes representing 'else' clauses. It's not rocket science, but it isn't pretty:
def tail_after_body(self, body, aList, result):
'''
Return the tail of the 'else' or 'finally' statement following the given body.
aList is the node.orelse or node.finalbody list.
'''
node = self.last_node(body)
if node:
max_n = node.lineno
leading = self.leading_lines(aList[0])
if leading:
result.extend(leading)
max_n += len(leading)
tail = self.trailing_comment_at_lineno(max_n + 1)
else:
tail = '\n'
return tail
Note that cst.tail_after_body
just calls ts.tail_after_body
.
Summary
The TokenSync class encapsulates most of the complexities involved in making token-oriented data available to ast traversal code. Using the TokenSync class is straightforward, but the ast visitors for all Python statements (and ast.Str) must include calls to ts.leading_lines
, ts.trailing_comment
and ts.sync_string
. Furthermore, the ts.tail_after_body
hack is needed to handle "missing" ast nodes.
In short, the code works well, but is just a bit clumsy.
@Andrei: your short answer might suggest that you know of a more elegant way. If so, I would love to see it.
Edward K. Ream
A few people have already mentioned lib2to3 but I wanted to create a more complete answer, because this tool is an under-appreciated gem. Don't bother with redbaron
.
lib2to3
is comprised of a few parts:
- the parser: tokens, grammar, etc
- fixers: library of transformations
- refactor tools: applies fixers to a parsed ast
- the command line: choose fixes to apply and run them in parallel using multiprocessing
Below is a brief introduction to using lib2to3
for transformations and scraping data (i.e. extraction).
Transformations
If you'd like to transform python files (i.e. complex find/replace), the CLI provided by lib2to3
is fully featured, and can transform files in parallel.
To use it, create a python package where each sub-module within it contains a single sub-class of lib2to3.fixer_base.BaseFix
. See lib2to3.fixes
for lots of examples.
Then create your executable script (replacing "myfixes" with the name of your package):
import sys
import lib2to3.main
def main(args=None):
sys.exit(lib2to3.main.main("myfixes", args=args))
if __name__ == '__main__':
main()
Run yourscript -h
to see the options.
Scraping
If your goal is to gather data, but not transform it, then you need to do a little more work. Here's a recipe I whipped up to use lib2to3
for data scraping:
# file: basescraper.py
from __future__ import absolute_import, print_function
from lib2to3.pgen2 import token
from lib2to3.pgen2.parse import ParseError
from lib2to3.pygram import python_grammar
from lib2to3.refactor import RefactoringTool
from lib2to3 import fixer_base
def symbol_name(number):
"""
Get a human-friendly name from a token or symbol
Very handy for debugging.
"""
try:
return token.tok_name[number]
except KeyError:
return python_grammar.number2symbol[number]
class SimpleRefactoringTool(RefactoringTool):
def __init__(self, scraper_classes, options=None, explicit=None):
self.fixers = None
self.scraper_classes = scraper_classes
# first argument is a list of fixer paths, as strings. we override
# get_fixers, so we don't need it.
super(SimpleRefactoringTool, self).__init__(None, options, explicit)
def get_fixers(self):
"""
Override base method to get fixers from passed fixers classes instead
of via dotted-module-paths.
"""
self.fixers = [cls(self.options, self.fixer_log)
for cls in self.scraper_classes]
return (self.fixers, [])
def get_results(self):
"""
Get the scraped results returned from `scraper_classes`
"""
return {type(fixer): fixer.results for fixer in self.fixers}
class BaseScraper(fixer_base.BaseFix):
"""
Base class for a fixer that stores results.
lib2to3 was designed with transformation in mind, but if you just want
to scrape results, you need a way to pass data back to the caller.
"""
BM_compatible = True
def __init__(self, options, log):
self.results = []
super(BaseScraper, self).__init__(options, log)
def scrape(self, node, match):
raise NotImplementedError
def transform(self, node, match):
result = self.scrape(node, match)
if result is not None:
self.results.append(result)
def scrape(code, scraper):
"""
Simple interface when you have a single scraper class.
"""
tool = SimpleRefactoringTool([scraper])
tool.refactor_string(code, '<test.py>')
return tool.get_results()[scraper]
And here's a simple scraper that finds the first comment after a function def:
# file: commentscraper.py
from basescraper import scrape, BaseScraper, ParseError
class FindComments(BaseScraper):
PATTERN = """
funcdef< 'def' name=any parameters< '(' [any] ')' >
['->' any] ':' suite=any+ >
"""
def scrape(self, node, results):
suite = results["suite"]
name = results["name"]
if suite[0].children[1].type == token.INDENT:
indent_node = suite[0].children[1]
return (str(name), indent_node.prefix.strip())
else:
# e.g. "def foo(...): x = 5; y = 7"
# nothing to save
return
# example usage:
code = '''\
@decorator
def foobar():
# type: comment goes here
"""
docstring
"""
pass
'''
comments = scrape(code, FindTypeComments)
assert comments == [('foobar', '# type: comment goes here')]
If you're using python 3, you can use bowler
, which is based on lib2to3, but provides a much nicer API and CLI for creating transformation scripts.
https://pybowler.io/
LibCST provides a Concrete Syntax Tree for Python that looks and feels like an AST. Most of node types are the same as AST while formatting information (comment, space, comma, etc) are available. https://github.com/Instagram/LibCST/
In [1]: import libcst as cst
In [2]: cst.parse_statement("fn(1, 2) # a comment")
Out[2]:
SimpleStatementLine(
body=[
Expr(
value=Call(
func=Name(
value='fn',
lpar=[],
rpar=[],
),
args=[
Arg(
value=Integer(
value='1',
lpar=[],
rpar=[],
),
keyword=None,
equal=MaybeSentinel.DEFAULT,
comma=Comma( # <--- a comma
whitespace_before=SimpleWhitespace(
value='',
),
whitespace_after=SimpleWhitespace(
value=' ', # <--- a white space
),
),
star='',
whitespace_after_star=SimpleWhitespace(
value='',
),
whitespace_after_arg=SimpleWhitespace(
value='',
),
),
Arg(
value=Integer(
value='2',
lpar=[],
rpar=[],
),
keyword=None,
equal=MaybeSentinel.DEFAULT,
comma=MaybeSentinel.DEFAULT,
star='',
whitespace_after_star=SimpleWhitespace(
value='',
),
whitespace_after_arg=SimpleWhitespace(
value='',
),
),
],
lpar=[],
rpar=[],
whitespace_after_func=SimpleWhitespace(
value='',
),
whitespace_before_args=SimpleWhitespace(
value='',
),
),
semicolon=MaybeSentinel.DEFAULT,
),
],
leading_lines=[],
trailing_whitespace=TrailingWhitespace(
whitespace=SimpleWhitespace(
value=' ',
),
comment=Comment(
value='# a comment', # <--- comment
),
newline=Newline(
value=None,
),
),
)
You can use ast-comments
specifically for the case (https://pypi.org/project/ast-comments/)
The library uses ast
and tokenize
from the standard library as mentioned in https://stackoverflow.com/a/7457047/18794028
>>> import ast_comments as astcom
>>> source = """
... # some comments (1)
... some_variable = 'value' # inline comments (2)
... """
>>> tree = astcom.parse(source)
>>> node = tree.body[0]
>>> node.comments
('some comments (1)', 'inline comments (2)')
>>> astcom.dump(tree)
"Module(body=[Assign(targets=[Name(id='some_variable', ctx=Store())], value=Constant(value='value', kind=None), type_comment=None, comments=('some comments (1)', 'inline comments (2)'))], type_ignores=[])"
Other experts seem to think the Python AST module strips comments, so that means that route simply won't work for you.
Our DMS Software Reengineering Toolkit with its Python front end will parse Python and build ASTs that capture all the comments (see this SO example). The Python front end includes a prettyprinter that can regenerate Python code (with the comments!) directly from the AST. DMS itself provides the low-level parsing machinery, and a source-to-source transformation capability that operate on patterns written using the target language (e.g., Python) surface syntax.
精彩评论