I have to parse a complicated string format. Is implementing an automaton a sensible approach?
I am currently struggling with a particularly obnoxious string format that I have to parse. The strings can contain substrings that denote a variable property that has to be resolved. Imagine something like "ThisExampleStringContainsA[VARIABLE_PROPERTY]"
. Also, these properties can be arbitrarily nested and also they can have different meanings, dependending on context. If [VARIABLE_PROPERTY]
is in fact not a valid name of a variable (which of course has to be decided at runtime), it just becomes a normal part of the entire string and remains unchanged and verbatim. Followingly, there are no invalid strings, as the number of opening square brackets does not need to match the number of closing brackets! This]Is[A[Valid]]][ExampleToo!
. There are more rules, but this will give you an idea.
So, at the moment I am unsure how to approach this. My first tries have ended in an incredible mess of ifs and elses and I noticed more and more that the solution should propably incorporate some sort of state concept. Now, I am thinking more and more about using an automaton to do this. However, I have encountered automatons only as pure theoretical constructs. I never came across an actual implementation. Furthermore, automatons are traditionally used to validate a word, i.e. determining if it belongs to a formally defined language. Needless to say, it is difficult for me to come up with a formal definition of that language.
How would you approach this? Do you think actually implementing an automaton is a sane approach? How would you model this from an OO design point of view? The project is in C#, if that makes any difference. Would you suggest something entirely different?
/Edit: My description may have been a bit misleading, here are some more details: The problem for me is to find the properties in the right order (from innermost to outermost). Once you have identified the next property to resolve, the actual substitution with its final value is relatively easy.
Let's take the example from above and I 'll give you a step by step example of what should happen.
The full input string is: This]Is[A[Valid]]][ExampleToo!
The first closing bracket and the last opening bracket are just normal characters, as they don't enclose anything. The same goes for all characters that are not between a matching bracket pair. That leaves us with the part [A[Valid]]]
. The innermost property has to be resolved first, that would be [Valid]
. The brackets just enclose the property identifying string, so Valid
is the name of the property we are about to resolve. Let's say, this string does in fact identify a property and it gets replaced with its actual value, let's say Foo
. The identifying string including the brackets gets replaced, so [Valid]
becomes Foo
.
Now, we have to look at [AFoo]]
. Let's pretend AFoo
does NOT identify a property, that leaves the substring unchanged (including the brackets).
Finally, the second closing bracket after AFoo
has no matching opening bracket and is therefore also just a character.
After processing 开发者_StackOverflowis complete, the entire string would read: This]Is[AFoo]][ExampleToo!
I hope this example makes things a bit more clear. Please keep in mind, that I have simplified the string format here! This is just to give you an idea, what difficulties I am facing. I don't expect working code, I am looking for answers that give me ideas on how to approach the problem. Since this parsing has to be done for many thousands of strings the solution must have a somewhat reasonable performance.
How about plain old recursion? Seems like a good fit here.
精彩评论