Producing squiggles using your abstract-syntax-tree
The situation:
I have crafted a scanner, parser and various AST classes for a little used programming language, a hobby project of mine. The parser, with the help of the scanner, builds a heterogenous AST which I do some manipulations on. In the past I have created a plugins/add-ins for some IDEs for syntax highlighting and some other elements.
The problem lies in the errors: the parser generates some and has access to the tokens that constitute a statement. Some errors only arise later however, such as being unable to resolve an identifier. I would like to display squiggles under such identifiers or other faulty tokens. Not just that, I like the ability to manipulate my AST nodes without losing all comments, spacing and so forth in my original document.
When creating a new Statement in my AST I can easily add the tokens that make up this statement as children. But ...
The question:
If reasonably doable I wish to include support for displaying squiggles. This requires a statement to be aware of where开发者_JAVA技巧 the tokens that make up the statement are positioned. Unfortunately, a statement has variations that sometimes include more Tokens, sometimes less tokens. If I were making a read-only AST, this would not be a problem. However, I wish my AST to be read-write for refactoring purposes! This means that altering a statement in the AST essentially means adding Tokens (the children of a Statement) and thus the Statement class should be capable of reparsing itself.
This would polute the AST with parsing code and no longer maintaining seperation of concerns!
Technical details:
One alternative is to turn the AssignmentStatement into a factory, getting a set of tokens, producing an instance of the statement and constantly have it be aware of its own tokens.
An AST-node for assignments in my case is basically be like this at the moment:
A hierarchical sample AST could be:
AssemblyDeclaration
.. Statement ..
.. Statement ..
ClassDeclaration
.. Token .. // one or more that make up the entire class statement
.. Token .. // one or more that make up the entire class statement
Statement(s)
.. Token .. // Which have their own tokens that make up the statement .. and possibly have sub-nodes of their own such as Expressions which have -their- own Tokens that comprise it.
A conceptual idea of the base ast node, the parent for everything in the syntax tree, no matter if it is a class declaration, statement or token.
public abstract class BaseAstNode : IList<BaseAstNode>
{
... implementation of IList<BaseAstNode>
... implementation of Visitor Pattern
... implementation of Clonable
}
public sealed class AssignmentStatement : BaseAstNode
{
public Expression Expression { get; set; }; // Setting this will alter the Tokens (children!) of this node, possibly even ADD Tokens!
public TypeReference Target { get; set; }
}
public sealed class PrimitiveNumberExpression : Expression // is a BaseAstNode
{
public int Value { get; set; } // Setting this will alter the Tokens (children!) of this node!
}
public abstract class Token : BaseAstNode
{
public Layers Layer { get; set; }
public TokenType TokenType { get; set; }
public int Column { get; set; }
public int Line { get; set; }
public int Position { get; set; }
public abstract int Length { get; set; }
public virtual string Value { get; set; }
public override string ToString(){}
}
How do others solve this problem? Is this the right way?
If you can get the code for the statement where the error occurred, you should be able to re-parse the individual statement, this time tracking the location of the tokens and noting those that are invalid (since you didn't have the locations of the tokens saved the first time around). At least you don't have to parse the whole document over, just one line... perhaps re-using some of the parsing code you already have.
Alternatively, do store the location of every token (start and length in the document) so you can easily access the position of that token when you need to refer to its source. Why not do that?
Some possible optimizations, if you can re-generate the token text from the token object, would be to not store the token length, or just search for that text within the line, assuming you want to do the same thing to every occurrence of that token on the line (or all lines).
Edit: Now that you have the linked the tokens to the original source code from which they were parsed, you'd need to link the statements to the tokens from which they were parsed. There are two ways you could do this:
- Add a List<Token> member to the BaseAstNode. OR
- Add specific Token members to each derived statement class so you know exactly which token links to which part of the statement.
Now you have a chain all the way from the statement to the source code so you can identify the source code associated with each part of each statement.
Edit 2: If you are trying to figure out how the actual parsing would process tokens into BaseAstNode-derived objects, I have not yet considered that question (it wasn't very clear that this was part of the question). Since I have not done this before, my suggestion may not be the best on the first attempt. But hopefully it can be refined as necessary. My first thoughts lead me to the System.Xml namespace, which also implement a parser. Perhaps some of your design can be modeled after the XML parsers and related classes. XmlNode is probably similar to your BaesAstNode class. XmlElement is probably analogous to your Token class. XmlDocument is a specialization of XmlNode that handles all the parsing in one LoadXml function, which populates the object with children.
So following the Xml namespace's example, you could make ParseCode be a member of ClassDeclaration that populates the object with all the appropriate child objects. It could accept a string or an ordered collection of tokens as input, and add derived instances of BaseAstNode to an ordered collection of BaseAstNode objects stored in the class. But not all of the parsing could would need to be in one function. As you are parsing text or tokens, I assume you would start out in the ParseCode function, pushing parsed tokens onto a stack or queue until you have parsed enough tokens to uniquely identify what kind of statement you are parsing, then pass those queued tokens and the current state of the parser to an appropriate derived parser on one of the statement classes. When the derived class finishes parsing the tokens that define the statement, the tokens would be added to the statement, then the calling code would add the parsed statement to the parent collection of statements.
精彩评论