Token Attributes
I have written simple lexical analyzer. And I understand the need to provide each recognized token with attribute. Let's see what I got:
public sealed class Token
{
public enum TokenClass
{
Identifier,
StringLiteral,
NumberLiteral,
Operator,
PunctuationSeparator,
Bracket,
Parenthesis
}
public TokenClass Class { get; internal set; }
public String Value { get; internal set; }
}
In lexer I enqueue tokens setting up thier value & class. But what about attributes? How should I design the feature relative to my existing token class?
First tought came into 开发者_StackOverflowmy mind was:
- Declare private abstract classes of "ambiguous-entities" (I mean that Number could be Integer and Real and so on) inside token class;
- Then declare inherited classes e.g.
public class Comma : PunctuationSeparator {}
; - Add Property
Object Attribute {get; private set;}
; - Then create method like
private void ApplyAttribute()
; - Call
ApplyAttribute()
when token is instantiated and properties are set; Use something like this inside
ApplyAttribute()
.switch(this.TokenClass) { case this.TokenClass.Number: { this.Attribute = (Int32.TryParse(this.Value))? new Integer() : new Real(); } }
In parser it would be easy to write something like that if(CurToken.Attribute is Integer)
.
One thing that stops me from doing like that is number of classes I should create. Is this solution acceptable?
The attributes I'd use for a token? Probably something along the lines of
public class Token
{
public TokenType Type { get ; private set ; }
public string Text { get ; private set ; }
public int LineNumber { get ; private set ; }
public int Column { get ; private set ; }
}
public enum TokenType
{
Keyword : 1 ,
Integer ,
String ,
Whitespace ,
Comment ,
...
}
I disagree, though, with the previous poster regarding conversion of the token's text into a 'value'. IMHO, that is the domain of the parser and the nodes of the parse tree. Until the parser has placed the tokens in context WRT the grammar, the token is just a piece of text with a label attached to it. The lexical analyzer doesn't know (and should care) what's happening downstream -- for all it know, the took is pretty-printing the source text (in which case, you want to leave the individual tokens alone).
You might want to take a look at Terrance Parr's book(s):
- Language Implementation Patterns: Create Your Own Domain-Specific and General Programming Languages
- The Definitive ANTLR Reference: Building Domain-Specific Languages
Instead of
public String Value { get; internal set; }
just use
public object Value { get; internal set; }
and then store integer or floating-point values in there as an integer or floating-point value. Then in your parser you can just say
if (token.Value == null)
{
// blah
}
else if (token.Value is int)
{
// work with (int) token.Value
}
else if (token.Value is double)
{
// work with (double) token.Value
}
else if (token.Value is string)
{
// work with (string) token.Value
}
or alternatively:
int? integer;
double? floating;
string str;
if (token.Value == null)
{
// blah
}
else if ((integer = token.Value as int?) != null)
{
// work with integer.Value
}
else if ((floating = token.Value as double?) != null)
{
// work with floating.Value
}
else if ((str = token.Value as string) != null)
{
// work with str
}
精彩评论