Parsing "off-side" (indentation-based) languages
An off-side language is the one where
...the scope of declarations (a block) in that language is expressed by their indentation.
Examples of such languages are Python, Boo, Nemerle, YAML and several more.
So my question is this: how do I actually parse these? How do I resolve tabs vs spaces problem (are two tabs or 8 spaces equivalent or not)? Are parser generators of any help here开发者_Go百科 or do I have to hand-code lexer/parser myself?
Python has a lexer that generates Indent and Dedent tokens, that are equivalent to curly braces ("{", "}"). There is even an example on Stack Overflow, with a simple implementation of such a lexer.
For tabs vs. spaces, Python only has a coding convention: Use 4 spaces per indentation level. Tabs are legal syntax though.
The easiest way to resolve the tabs versus spaces problem is to disallow combinations of spaces and tabs (this is what's done in F#, for instance). Any modern editor will allow tabs to be converted to some number of spaces.
As for whether you need to abandon parser generators, probably not, but you will have to hack the offsides identification in there somewhere. This may require a bit of creativity on your part. Based on browsing the F# source, it looks like they use a post-lexing step to create additional tokens representing offside language elements.
How do I resolve tabs vs spaces problem (are two tabs or 8 spaces equivalent or not)?
It depends on how the editor's settings if two tabs will equal eight spaces.
The off-side rule, as expressed by the originator, mentions relative positioning of two successive lines of code and not the absolute number of whitespaces. Here's a nice read to help you better understand (and some quote):
"Whitespace is significant in Python source code."
No, not in general. Only the indentation level of your statements is significant (i.e. the whitespace at the very left of your statements). Everywhere else, whitespace is not significant and can be used as you like, just like in any other language. You can also insert empty lines that contain nothing (or only arbitrary whitespace) anywhere.
Also, the exact amount of indentation doesn't matter at all, but only the relative indentation of nested blocks (relative to each other). [...]
For what it's worth, Haskell is also indentation-based and optionally { foo; bar; etc } for when whitespace is inconvenient. I wrote a simple indentation-based parser with Parsec, it's indended to read like Lisp but indentation indicates operator application. Parentheses can be only used on one line.
(aaa bb) cc
e fffff (ggg hhh) iii
jjj kkk
ddd
Here aaa
is applied to bb
. What results is a ternary function. It is applied to arguments cc
, e
applied to one argument, and ddd
. See how the application is based on column alignment, and not X spaces.
The parser could probably be a lot simpler, too.
You've got several options w.r.t. tabs and spaces: either disallow mixing tabs and spaces, assume a fixed ratio of tabs to spaces, or permit the programmer to decide on a per project or per source file basis (some sort of "#pragma tab(4)" style directive to permit tabs and/or change the number of spaces they represent).
Parser generator such as ANTLR 3 can easily cope with this; I've been toying with an example myself, compiling to its C# target. The link in DirkGently's answer explains the Python algorithm, which translates directly into code. My approach was simply to define separate tokens for whitespace and newlines, and override the "emit token" function used by the lexer to insert additional indent/dedent tokens on the fly. This turned out to be simpler to implement than other approaches I've seen around which override the "get last token" function, but either works pretty well.
I got one solution by myself, which I analyse the code just like blocks for the nesting tree. For the part of brackets, I just used normal method.
Here's that parser: https://github.com/jiyinyiyong/cirru-parser
精彩评论