How to efficiently implement an immutable graph of heterogenous immutable objects in C++?
I am writing a programming language text parser, out of curiosity. Say i want to define an immutable (at runtime) graph of tokens as vertices/nodes. These are naturally of different type - some tokens are keywords, some are identifiers, etc. However they all share the common trait where each token in the graph points to another. This property lets the parser know what may follow a particular token - and so the graph defines the formal grammar of the language. My problem is that I stopped using C++ on a daily basis some years ago, and used a lot of higher level languages since then and my head is completely fragmented with regards to heap-allocation, stack-allocation and such. Alas, my C++ is rusty.
Still, I would like to climb the steep hill at once and set for myself the goal of defining this graph in this imperative language in a most performant way. For instance I want to avoid allocating each token object separately on the heap using 'new' because I think if I allocate the entire graph of these tokens back-to-back so to speak (in a linear fashion like elements in an array), this would benefit the performance somehow, per locality of reference principle - I mean when the entire graph is compacted to take up minimal space along a 'line' in memory, rather than having all its token objects at random locations, that is a plus? Anyway, like you see, this is a bit of a very open question.
class token
{
}
class word: token
{
const char* chars;
word(const char* s): chars(s)
{
}
}
class ident: token
{
/// haven't thought about these details yet
}
template<int N> class composite_token: token
{
token tokens[N];
}
class graph
{
token* p_root_token;
}
The immediate question is: what would be the procedu开发者_StackOverflowre to create this graph object? It's immutable and it's thought structure is known at compile time, that's why I can and want to avoid copying stuff by value and so on - it should be possible to compose this graph out of literals? I hope I am making sense here... (wouldn't be the first time I didn't.) The graph will be used by the parser at runtime as part of a compiler. And just because this is C++, I would be happy with a C solution as well. Thank you very much in advance.
My C++ is rusty as well, so I probably don't know the best solution for this. But since nobody else stepped forward...
You are right in that allocating all nodes in one block would give you the best locality. However, if you dynamically allocate the graph at program start, chances are that your heap allocations will also cluster together closely.
To allocate all nodes in a single memory block, two possibilities come to my mind: create and populate a Vector<> at startup (with the drawback that now you have the graph information twice in memory), or use a static array initializer "Node[] graph = { ... };" .
For either approach, the biggest obstacle is that you want to create your graph of heterogenous objects. One obvious solution is "Don't": you could make your node a superset of all possible fields, and distinguishing the types with an explicit 'type' member.
If you want to keep the various node classes, you will have to use multiple arrays/vectors: one for each type.
Either way, the connections between the nodes will have to be initially defined in terms of array indices (Node[3] is followed by Node[10]). For better parsing performance, you could create direct object pointers at program startup based on these indices, of course.
I would not put literal strings into any node ('word' in your case): the recognition of keywords, identifiers and other lexical elements should be done in a lexer module separate from the parser. I think it would also help if you distinguish in terminalogy between the tokens generated by the Lexer based on the program's input, and the grammar graph nodes your program uses to parse the input.
I hope this helps.
I don't see how you will define a "graph" of tokens that defines the syntax of any practical programming language, especially if the relation betweens tokens is "allowed-to-follow".
The usual way to represent the grammar of programming language is using Backus-Naur Form (BNF) or Extended versions of this termed "EBNF".
If you wanted to represent an EBNF ("as an immutable graph"), this SO answer discusses how to do that in C#. The ideas have direct analogs in C++.
The bad news is that most parsing engines can't use the EBNF for directly because it is simply too inefficient in practice. It is hard to build an efficient parser using the direct representation of the grammar rules; this is why people invented parser generators. So the need to put these rules into a memory structure at all, let alone an "efficient" one, is unclear unless you intend to write a parser generator.
Finally, even if you do pack the grammar-information somehow optimally, it probably won't make an ounce of difference in actual performance. Most of a parser's time is spent in grouping characters in lexemes, sometime even to the point of just doing blank supression.
I don't think many small allocations of the tokens will be a bottleneck, if it does you can always choose a memory pool.
Onto the problem; since all tokens have similar data (having a pointer to the next, and perhaps some enum value for what token we're dealing with) you could put the similar data in one std::vector. This will be continuous data in memory, and very efficient to loop over.
While looping, you retrieve the kind of information you need. I bet the tokens themselves would ideally only contain "actions" (member-functions), such as: if previous and next tokens are numbers, and I'm a plus sign, we should add the numbers together.
So, the data is stored in one central place, the tokens are allocated (but might not contain much data themselves actually) and work onto the data at the central place. This is actually a data-oriented design.
The vector could look like:
struct TokenData
{
token *previous, *current, *next;
token_id id; // some enum?
... // more data that is similar
}
std::vector<TokenData> token_data;
class token
{
std::vector<TokenData> *token_data;
size_t index;
TokenData &data()
{
return (*token_data)[index];
}
const TokenData &data() const
{
return (*token_data)[index];
}
}
// class plus_sign: token
// if (data().previous->data().id == NUMBER && data().next->data().id == NUMBER)
for (size_t i = 0; i < token_data.size(); i++)
{
token_data[i].current->do_work();
}
It's an idea.
精彩评论