Are there any free parser generators that generate C++ code and handle Unicode correctly?
After asking this question, I'm now sold on trying to use a parser generator, where before I was going to write things manually.
However, I can't seem to find any such parser that generates C++ code, nor can I find a parser that correctly handles Unicode. (note that my input is in UCS-2 -- I don't care about supporting bits outside of the Basic Multilingual Plane if that makes building the parser more difficult)
There are some parsers which can generate C, but such parsers all seem to throw exception safety out the window, which would prevent me from using C++ inside any semantic actions.
Does a parser generator exist which meets these two tenets, or am I stuck doing everything by hand?
EDIT: Oh, and my project is BSL licensed, so there can't be many restrictions on use of the output 开发者_开发百科of the parser generator itself.
There are two way in C++. Using a program, that genereates C++ files from a grammar that is written in a free form or using templates.
And you have two choice when you writing a grammar in template types. Using the boost::proto, where every operator is redefinied to build a syntax tree in boost::fusion (used in boost::spirit, boost::msm, boost::xpressive). (basic idea is here:Expression Templates) or building an expression tree written by hand with the help of own templates and store it directly boost::mpl containers. This thecnique is used in biscuit.
In biscuit you have
or_<>, seq_<>, char_<>, ..
templates. Biscuit is based on Yard, but extended with an extended boost::range to get a better submatch capabaility.
The Biscuit Parser Library 1
The Biscuit Parser Library 2
Yet Another Recursive Descent (YARD) parsing framework for C++
Alright this might be a long shot but there is a parser generator (LALR) as a side project to Qt it is called QLALR it is a really thin layer, the lexing is still up to you, but all the work can be done through QStrings which support unicode. There is not a lot of functionality to it, you write the grammar with the code that does the work for each token, and it will generate the parser for you. But I have used it successfully generate a parser with ~100 rules, creating an AST of the language parsed.
ANTLR has Unicode support. It has C++ (and C, Java and a few other languages) support, though I've never used the C++ support so I'm not sure how well developed it is.
There appears to be preliminary support for unicode in boost::spirit
if you're in the mood to experiment, this one supports wide chars but is obscure: http://wiki.winprog.org/wiki/LibCC_Parse
The parser doesn't care about characters since it processes tokens.
Lexing Unicode is very expensive. This is because you either pay a huge function calling overhead for classification, or you kill your memory with massive tables. Normally you'd only support Unicode is specific places in a PL, such as string literals and perhaps identifiers where a handcrafted function can do the job efficiently.
I once coded in Ocamllex a lexer that would accept the identifiers mandated by the ISO C++ standard (which includes a set of ranges of Unicode code points considered as "letters" in various languages). Although the number of code point ranges is quite small (around 20 or so ranges), the UTF-8 DFA for this has over 64K states and blew up the lexer generator :)
My advice here is: you will have to hand craft your lexer. It is, in fact, very easy to do this inefficiently. Doing it efficiently is very much harder: I'd be looking at Judy arrays for support (this is the fastest data structure on the planet).
Try Boost.Spirit. You can plug-in your own "stream decoder", which handles the unicode-part of your problem. To make Sprit work with wchar_t
should be possible -- although, I have not tried it myself.
I don't know a whole lot of theory about parsers, so forgive me if this doesn't fit the bill, but there is Ragel.
Ragel generates state machines. It's (perhaps most famously?) used by the Mongrel HTTP server for Ruby to parse HTTP requests.
Ragel targets plain C (amongst others), but all of the state machine data is either static const or stack allocated, so that should alleviate some important concerns with C++ exceptions. If special exception handling is required, Ragel doesn't shy away from exposing its internals. (Not as complex as it may sound.)
Unicode should be possible, because input is an array of any basic type, usually char
, but probably short
or int
in your case. If that doesn't do, you can even replace the array iteration with your own mechanism for getting the next input item/token/event.
精彩评论