开发者

Parsing Tree-structure into Relational-style data-store

Would someone be able to help with how I could implement this, or at least the algorithm to use for this.

What I am trying to do is parse a a hierarchical/tree structure file into a relation store. I will explain further below, with an example.

This is a sample source file, just a simple/non-realistic example for purposes of this question.

<title text=“title1">
    <comment id=“comment1">
        <data> this is part of comment one</data>
        <data> this is some more of comment one</data>
    </comment>
    <comment id=“comment2”>
        <data> this is part of comment two</data>
        <data> this is some more of comment two</data>
        <data> this is even some more of comment two</data>
    </comment>
</title>

So the main thing to note here is that the number of <comment>, and the number of <data> elements for each comment may be arbitrary. So given the above, I would want to transform into something looking like:

title     |   comment     |      data
------------------------------------------------------------------------
title1       comment1            this is some part of comment one
title1       comment1            this is some more of comment one
title1       comment2            this is part of comment two       
title1       comment2            this is some more of comment two
title1       comment2            this is even some more of comment two 

In order to make this happen, lets say I can have specified the relational schema in the following manner, using an xpath expression that can be evaluated on the source file.

attribute1: title   =  /title/@title
attribute2: comment =  /tit开发者_StackOverflow中文版le/comment/@id
attribute3: data    =  /title/comment/data/text()

Suggested Data-structures:

  • ResultSet is a List<Map<String,String>> (where: each map represents a single row)
  • Schema is a Map<String,String> (where: we map attribute-name --> path expression)
  • Source file, some DOM Document


I'm not sure whether you're asking how to implement the XML parser itself or how, given a parse tree for the XML, how to flatten it into a hierarchical structure. I'm guessing that you're looking at the latter of these now (there are many good XML parsers out there and I doubt that's the bottleneck), so I'll answer that here. Let me know if you're actually interested in the XML parsing detail and I can update the answer.

I believe that the way you want to think about this is with a recursive descent over the tree. The idea is as follows: your naming system consists of the concatenation of all the nodes above you in the tree followed by your own name. Given that, you could run a recursive DFS over the tree using something like this:

FlattenXML(XMLDocument x) {
    for each top-level XML node t:
        RecFlattenTree(t, "");
}

RecFlattenTree(Tree t, String prefix) {
    if t is a leaf with data d:
       update the master table by adding (prefix, d) to the list of entries
    else
       for each child c of t, whose name is x:
           RecFlattenTree(c, prefix + "/" + x)
}

For example, if you were to trace this over the XML document you had up top, it might go something like this:

RecFlattenTree(title1, "/title1")
    RecFlattenTree(comment1, "/title1/comment1")
        RecFlattenTree(data node 1 , "/title1/comment1")
             Add /title1/comment1/data, value = "this is some part of comment one"
        RecFlattenTree(data node 2, "/title1/comment1")
             Add /title1/comment2/data, value = "this is some more of comment one"
    RecFlattenTree(comment2, "/title1/comment2")
        RecFlattenTree(data node 1 , "/title1/comment2")
             Add /title1/comment2/data, value = "this is part of comment two"
        RecFlattenTree(data node 2, "/title1/comment2")
             Add /title1/comment2/data, value = "this is more of comment two"
        RecFlattenTree(data node 3, "/title1/comment2")
             Add /title1/comment2/data, value = "this is even more of comment two"

Which ends up generating the list

/title1/comment1/data, value = "this is some part of comment one"
/title1/comment1/data, value = "this is some more of comment one"
/title1/comment1/data, value = "this is part of comment two"
/title1/comment1/data, value = "this is more of comment two"
/title1/comment1/data, value = "this is even more of comment two"

Which is exactly what you want.

Hope this helps! Let me know if I misinterpreted your question!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜