Learning bison: What are context-free grammars and LALR(1)?
I am reading this bison introduction.
I have two questions and it will be great if someone can help me understand:
What does term
context free grammar
mean?From the link above: Not all context-free languages can be handled by Bison, o开发者_JAVA百科nly those that are LALR(1). In brief, this means that it must be possible to tell how to parse any portion of an input string with just a single token of look-ahead. What does it mean by "possible to tell how to parse any portion of an input string with just a single token of look-ahead`"
A context-free grammar is a description of a set of strings that is strictly more powerful than the regular expressions, but still easily handled by a machine. More formally, a context-free grammar consists of four things:
A set of terminal symbols, which are the elements of the strings being produced. For a bison parser, this is typically the set of the tokens produced by the scanner. For a grammar for a natural language like English, this might be the set of all English words.
A set of nonterminal symbols. Intuitively, a nonterminal symbol represents something like a part of speech, such as "noun" or "verb."
A set of productions. Each production says how to replace a nonterminal symbol with some other set of terminals and nonterminals. For example, the production
Variable -> Type Name
says that if we see the nonterminalVariable
, we may replace it with the stringType Name
.A start symbol, which is the nonterminal from which the derivation begins.
As an example, consider this simple context-free grammar for C-style function declarations:
Function -> Type ident(Arguments)
Type -> int
Type -> Type *
Arguments -> e (the empty string)
Arguments -> ArgList
ArgList -> Type ident
ArgList -> Type ident, ArgList
Here, the start symbol is Function
. Given this grammar, we can produce C-style function declarations by repeatedly picking a nonterminal symbol and replacing it with one of the right-hand sides of a matching production. At each step, the string we've built so far is called a sentential form. For example, here are some different function declarations that can be derived from the above grammar:
Sentential Form Production
-------------------------------------------------------------------
Function Function -> Type ident(Arguments)
Type ident(Arguments) Type -> int
int ident(Arguments) Arguments -> e
int ident()
Sentential Form Production
-------------------------------------------------------------------
Function Function -> Type ident(Arguments)
Type ident(Arguments) Type -> Type*
Type* ident(Arguments) Type -> int
int* ident(Arguments) Arguments -> ArgList
int* ident(ArgList) ArgList -> Type ident, ArgList
int* ident(Type ident, ArgList) ArgList -> Type ident
int* ident(Type ident, Type ident) Type -> Type*
int* ident(Type* ident, Type ident) Type -> Type*
int* ident(Type** ident, Type ident) Type -> int
int* ident(int** ident, Type ident) Type -> int
int* ident(int** ident, int ident)
Most programming languages have a structure that can be described by a context-free grammar. The job of the parser is to take a program and a grammar and to determine how that program can be generated by the grammar.
As for LALR(1), unfortunately the formal definition of LALR(1) is not trivial. I just finished teaching a compiler construction course and we were only able to talk about LALR(1) after first spending two lectures discussing related parsing techniques. If you'd like a formal introduction to the material, my slides on bottom-up parsing are available at the course website.
LALR(1) is a type of parsing algorithm called a bottom-up parsing algorithm, which means that it tries to apply the productions of the grammar in reverse order to reduce the program back to the start symbol. For example, let's consider this string, which is generated by the above grammar:
int** ident(int* ident)
In a bottom-up parse, we would parse this string by looking at the program one token at a time. Whenever we found something that could be reversed back to some nonterminal, we do so. (To be more precise, LALR(1) only does these sorts of reductions when other criteria are met as well so that the algorithm has more context, but for this example we don't need to worry about this). Each step in the parse is said either to be a shift or a reduce. A shift means that we look at one more token of the input to gain more information about what reductions to apply. A reduce means that we take some number of the tokens and nonterminals and reverse a production to get back to some nonterminal.
Here's a trace of the bottom-up parse of the string:
Workspace Input Action
-----------------------------------------------------------------
int** ident(int* ident) Shift
int ** ident(int* ident) Reduce Type -> int
Type ** ident(int* ident) Shift
Type* * ident(int* ident) Reduce Type -> Type*
Type * ident(int* ident) Shift
Type* ident(int* ident) Reduce Type -> Type*
Type ident(int* ident) Shift
Type ident (int* ident) Shift
Type ident( int* ident) Shift
Type ident(int * ident) Reduce Type -> int
Type ident(Type * ident) Shift
Type ident(Type* ident) Reduce Type -> Type*
Type ident(Type ident) Shift
Type ident(Type ident ) Reduce ArgList -> Type ident
Type ident(ArgList ) Reduce Arguments -> ArgList
Type ident(Arguments ) Shift
Type ident(Arguments) Reduce Function -> Type ident(Arguments)
Function ACCEPT
It's important to know about shifts and reductions because you will invariable encounter shift/reduce conflicts and reduce/reduce conflicts when using bison. These errors mean that the parser generator determined that the parser might get into a state where it can't tell whether to shift or reduce, or which of two reductions it's supposed to perform. Consult the bison manual for more details about how to deal with this.
If you'd like to learn more about context-free grammars and parsing algorithms in general, there is an excellent book entitled Parsing Techniques: A Practical Guide, Second Edition by Grune and Jacobs that has, by far, the best treatment of the material I've ever seen. It covers all sorts of parsing algorithms, including many techniques that are substantially more powerful than LALR(1) that are starting to get wider usage (GLR and Earley, for example). I highly recommend this book - it's the main reason I understand parsing so well!
Hope this helps!
1) In simple terms -- It means that the formating and context of the code is not important for the individual parts, and you don't have see how it is formatted to understand the meaning. As an example, you could write your C-program in one line, or with every word on a different line and the program would mean the same. Wikipedia have a more formal definition.
2) What LALR(1) means is more complicated, but in simple terms I would describe that as understanding the meaning incrementally by reading one word at the time -- only seeing the next word/symbol -- some spoken languages are famous for not being LALR(1) -- where you unly understand the sentence being a question or a statement when you get to the end of the sentence. Some programming languages could be constructed that way as well, but I'm not aware of any as they for practical reasons all conform to a LALR(1) syntax. I do however think there are exceptions where C/C++ needs a 2 symbol look ahead to correctly parse the statements (hence being LALR(2), but since I cannot remember from top of head what they are, I hope somebody will point that out in a comment.
In any cases this book here is the bible when it comes to understand compilers and parsers.
精彩评论