Parsing "off-side" (indentation-based) languages

2022-12-18 13:15 问答作者：

An off-side language is the one where

...the scope of declarations (a block) in that language is expressed by their indentation.

Examples of such languages are Python, Boo, Nemerle, YAML and several more.

So my question is this: how do I actually parse these? How do I resolve tabs vs spaces problem (are two tabs or 8 spaces equivalent or not)? Are parser generators of any help here开发者_Go百科 or do I have to hand-code lexer/parser myself?

Python has a lexer that generates Indent and Dedent tokens, that are equivalent to curly braces ("{", "}"). There is even an example on Stack Overflow, with a simple implementation of such a lexer.

For tabs vs. spaces, Python only has a coding convention: Use 4 spaces per indentation level. Tabs are legal syntax though.

The easiest way to resolve the tabs versus spaces problem is to disallow combinations of spaces and tabs (this is what's done in F#, for instance). Any modern editor will allow tabs to be converted to some number of spaces.

As for whether you need to abandon parser generators, probably not, but you will have to hack the offsides identification in there somewhere. This may require a bit of creativity on your part. Based on browsing the F# source, it looks like they use a post-lexing step to create additional tokens representing offside language elements.

How do I resolve tabs vs spaces problem (are two tabs or 8 spaces equivalent or not)?

It depends on how the editor's settings if two tabs will equal eight spaces.

The off-side rule, as expressed by the originator, mentions relative positioning of two successive lines of code and not the absolute number of whitespaces. Here's a nice read to help you better understand (and some quote):

"Whitespace is significant in Python source code."

No, not in general. Only the indentation level of your statements is significant (i.e. the whitespace at the very left of your statements). Everywhere else, whitespace is not significant and can be used as you like, just like in any other language. You can also insert empty lines that contain nothing (or only arbitrary whitespace) anywhere.

Also, the exact amount of indentation doesn't matter at all, but only the relative indentation of nested blocks (relative to each other). [...]

For what it's worth, Haskell is also indentation-based and optionally { foo; bar; etc } for when whitespace is inconvenient. I wrote a simple indentation-based parser with Parsec, it's indended to read like Lisp but indentation indicates operator application. Parentheses can be only used on one line.

(aaa bb) cc
         e fffff (ggg hhh) iii
                 jjj kkk
         ddd

Here aaa is applied to bb. What results is a ternary function. It is applied to arguments cc, e applied to one argument, and ddd. See how the application is based on column alignment, and not X spaces.

The parser could probably be a lot simpler, too.

You've got several options w.r.t. tabs and spaces: either disallow mixing tabs and spaces, assume a fixed ratio of tabs to spaces, or permit the programmer to decide on a per project or per source file basis (some sort of "#pragma tab(4)" style directive to permit tabs and/or change the number of spaces they represent).

Parser generator such as ANTLR 3 can easily cope with this; I've been toying with an example myself, compiling to its C# target. The link in DirkGently's answer explains the Python algorithm, which translates directly into code. My approach was simply to define separate tokens for whitespace and newlines, and override the "emit token" function used by the lexer to insert additional indent/dedent tokens on the fly. This turned out to be simpler to implement than other approaches I've seen around which override the "get last token" function, but either works pretty well.

I got one solution by myself, which I analyse the code just like blocks for the nesting tree. For the part of brackets, I just used normal method.

Here's that parser: https://github.com/jiyinyiyong/cirru-parser

继续阅读：compiler-theory indentation language-design

Parsing "off-side" (indentation-based) languages

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？