Writing a Tokenizer class in C# for a BASIC Interpreter
For a bit of fun, I'm attempting to write a BASIC interpreter in C#. Following is my Tokenizer class (with just a few keywords). I'm after any suggestions or comments... Code clarity is more important to me than efficiency. Many thanks.
class Tokenizer
{
const string Operators = "+-/*%<>=&|";
private List<string> Keywords = new List<string>{"LET", "DIM", "PRINT", "REM"};
private List<string> tokens = new List<string>();
private List<string> tokenTypes = new List<string>();
private int tokenIndex;
// Turn command string into tokens
public void Tokenize(string cmdLine)
{
string token = "";
char lastc = ' ';
bool inString = false;
tokens.Clear();
tokenTypes.Clear();
// Step through line and split into tokens
foreach (char c in cmdLine)
{
if (c == '"') inString = !inString;
if (!inString)
{
if (IsOperator(lastc)) AddToken(ref token);
if (IsWhitespace(c)) AddToken(ref token);
if (IsOperator(c)) AddToken(ref token);
if (IsNumber(c) && !IsNumber(lastc)) AddToken(ref token);
if (!IsWhitespace(c)) token += c;
}
else
token += c;
lastc = c;
}
// Add last token
AddToken(ref token);
tokenIndex = 0;
}
public string Token()
{
return tokens[tokenIndex];
}
public string TokenType()
{
return tokenTypes[tokenIndex];
}
public void NextToken()
{
tokenIndex++;
}
public bool TokensLeft()
{
return tokenIndex < tokens.Count;
}
// Add a token to the collection
private void AddToken(ref string token)
{
开发者_开发知识库 if (token.Trim() != "")
{
// Determine token type
string tokenType = "Identifier";
if (IsOperator(token[0])) tokenType = "Operator";
if (IsNumber(token[0])) tokenType = "Number";
if (token[0] == '"') tokenType = "String";
if (Keywords.Contains(token.ToUpper())) tokenType = "Keyword";
tokens.Add(token);
tokenTypes.Add(tokenType);
token = "";
}
}
private bool IsWhitespace(char c)
{
return (c.ToString() != c.ToString().Trim());
}
private bool IsOperator(char c)
{
return Operators.Contains(c);
}
private bool IsNumber(char c)
{
return Char.IsNumber(c);
}
}
You usually never want to write a parser code like that manually, learning a good parser generator tool such as Antlr is a good investment of your time if you're going to handle parsing of computer languages for more than just fun/coding exercise. That being said, if you really want to do this manually, you have some things to think about:
- Use a StringBuilder instead of a string
- Just checking for quotes is not enough, what about escaped quotes?
- How will you handle floating point numbers (in all formats)?
- How will you handle identifiers that include numbers?
Those are a few issues you'll run into, again I really recommend learning a parser generator tool, it makes this kind of thing much more fun (not to mention correct and efficient).
精彩评论