Token() objects in Lepl
So I'm making my way through the Lepl tutorial, a Python parser, and I can't quite figure out what exactly the difference is betw开发者_运维技巧een something like Token(Real())
and just Real()
. I found the docs on the function but they are pretty unhelpful.
So, what exactly does the Token() class do? Why is it different than regular Lepl classes?
Normally, LEPL operates on a stream of characters as they are in the input. That's simple, but as you have seen you'd need lots of redundant rules to ignore e.g. whitespace whereever it is legal but ignored.
This problem has a common solution, namely first running the input string though a relatively simple automaton that takes care of this and other distractions. It breaks the input into pieces (e.g. numbers, identifierts, operators, etc.) and strips ignored parts (e.g. comments and whitespace). This makes the rest of the parser simpler, but LEPL's default model has no place for this automaton, which is btw called tokenizer or lexical analyzer (lexer for short).
Each kind of token is usually defined as a regular expression that describes what goes into each token, e.g. [+-][0-9]+
for integers. You can (and sometimes should) do just that with Token()
, e.g. Token('a+b+')
gives a parser that consumes as much of the input as the regex matches, then returns it as a single string. For the most part, these parsers work just like all others, most importantly, they can be combined in the same ways. For example, Token('a+') & Token('b+')
works and is equivalent to the previous except that is produces two strings, and Token('a+') + Token('b+')
is exactly equivalent. Thus far, they're just a shorter notation for some basic building blocks of some grammars. You can also use some of LEPL's classes with Token()
to convert it into an equivalent regular expression and use that as token - e.g. Token(Literal('ab+'))
is Token(r'ab\+')
.
The one important difference and huge advantage is that, using tokens, you can also give patterns that drop in and discard some input if there's no other token that would match - the default discards whitespace, which makes ignoring whitespace very easy (while still allowing the parser to require whitespace in some places). The downside is that you have to wrap all non-token matchers in tokens or write equivalent rules by hand if they can't be converted automatically.
精彩评论