开发者

What parser-generators with code separation and language extensibility would you recommend?

I'm looking for 开发者_StackOverflow社区a context-free grammar parser generator with grammar/code separation and a possibility to add support for new target languages. For instance if I want parser in Pascal, I can write my own pascal code generator without reimplementing the whole thing.

I understand that most open-source parser generators can in theory be extended, still I'd prefer something that has extendability planned and documented.

Feature-wise I need the parser to at least support Python-style indentation, maybe with some additional work. No requirement on the type of parser generated, but I'd prefer something fast.

Which are the most well-known/maintained options?

Popular parser-generators seem to mostly use mixed grammar/code approach which I really don't like. Comparison list on Wikipedia lists a few but I'm a novice at this and can't tell which to try.

Why I don't like mixing grammar/code: because this approach seems like a mess. Grammar is grammar, implementation details are implementation details. They're different things written in different languages, it's intuitive to keep them in separate places.

What if I want to reuse parts of grammar in another project, with different implementation details? What if I want to compile a parser in a different language? All of this requires grammar to be kept separate.


Most parser generators won't handle context-free grammars. They handle some subset (LL(1), LL(k), LL(*), LALR(1), LR(k), ...). If you choose one of these, you will almost certainly have to hack your grammar to match the limitations of the parser generator (no left recursion, limited lookahead, ...). If you want a real context free parser generator you want an Early parser generator (inefficient), a GLR parser generator (the most practical of the lot), or a PEG parser generator (and the last isn't context-free; it requires rules to be ordered to determine which ones take precedence).

You seem to be worried about mixing syntax and parser-actions used to build the trees.

If the tree you build isn't a direct function of the syntax, there has to be some way to tie the tree-building machinery to the grammar productions. Placing it "near" the grammar production is one way, but leads to your "mixed" notation objection.

Another way is to give each rule a name (or some unique identifier), and set the tree-building machinery off to the side indexed by the names. This way your grammar isn't contaminated with the "other stuff", which seems to be your objection. None of the parser generator systems I know of do this. An awkward issue is that you now have to invent lots of rule names, and anytime you have a few hundred names that's inconvenient by itself and it is hard to make them mnemonic.

A third way is to make the a function of the syntax, and auto-generate the tree building steps. This requires no extra stuff off to the side at all to produce the ASTs. The only tool I know that does it (there may be others but I've been looking for 20 odd years and haven't seen one) is my company's product,, the DMS Software Reengineering Toolkit. [DMS isn't just a parser generator; it is a complete ecosystem for building program analysis and transformation tools for arbitrary languages, using a GLR parsing engine; yes it handles Python style indents].

One objection is that such trees are concrete, bloated and confusing; if done right, that's not true. My SO answer to this question: What is the difference between an Abstract Syntax Tree and a Concrete Syntax Tree? discusses how we get the benefits of ASTs from automatically generated compressed CSTs.

The good news about DMS's scheme is that the basic grammar isn't bloated with parsing support. The not so good news is that you will find lots of other things you want to associate with grammar rules (prettyprinting rules, attribute computations, tree synthesis,...) and you come right back around to the same choices. DMS has all of these "other things" and solves the association problem a number of ways:

  • By placing other related descriptive formalisms next to the grammar rule (producing the mixing you complained about). We tolerate this for pretty-printing rules because in fact it is nice to have the grammar (parse) rule adjacent to the pretty-print (anti-parse) rule. We also allow attribute computations to be placed near the grammar rules to provide an association.

  • While DMS allows rules to have names, this is only for convenient access by procedural code, not associating other mechanisms with the rule.

  • DMS provides a third way to associate these mechanisms (esp. attribute grammar computations) by using the rule itself as a kind of giant name. So, you write the grammar and prettyprint rules in one place, and somewhere else you can write the grammar rule again with an associated attribute computation. In principle, this is just like giving each rule a name (well, a signature) and associating the computation with the name. But it also allows us to define many, many different attribute computations (for different purposes) and associate them with their rules, without cluttering up the base grammar. Our tools check that a (rule,associated-computation) has a valid rule in the base grammar, so it makes it relatively each to track down what needs fixing when the base grammar changes.

This being my tool (I'm the architect) you shouldn't take this as a recommendation, just a bias. That bias is supported by DMS's ability to parse (without whimpering) C, C++, Java, C#, IBM Enterprise COBOL, Python, F77/F90/F95 with column6 continues/F90 continues and embedded C preprocessor directives to boot under most circumstances), Mumps, PHP4/5 and many other languages.


First off, any decent parser generator is going to be robust enough to support Python's indenting. That isn't really all that weird as languages go. You should try parsing column-sensitive languages like Fortran77 some time...

Secondly, I don't think you really need the parser itself to be "extensible" do you? You just want to be able to use it to lex and parse the language or two you have in mind, right? Again, any decent parser-generator can do that.

Thirdly, you don't really say what about the mix between grammar and code you don't like. Would you rather it be all implemented in a meta-language (kinda tough), or all in code?

Assuming it is the latter, there are a couple of in-language parser generator toolkits I know of. The first is Boost's Spirit, which is implemented in C++. I've used it, and it works. However, back when I used it you pretty much needed a graduate degree in "boostology" to be able to understand its error messages well enough to get anything working in a reasonable amount of time.

The other I know about is OpenToken, which is a parser-generation toolkit implemented in Ada. Ada doesn't have the error-novel problem that C++ has with its templates, so OpenToken is far easier to use. However, you have to use it in Ada...

Typical functional languages allow you to implement any sublanguage you like (mostly) within the language itself, thanks to their inhernetly good support for things like lambdas and metaprogramming. However, their parsers tend to be slower. That's really no problem at all if you are just parsing a configuration file or two. Its a tremendous problem if you are parsing hundreds of files at a go.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜