Extracting Info from Plain Text and Writing to XML Using DOM

2023-03-18 22:33 问答作者：

Currently, I'm designing some format conversion tools in the area of glycobiology. The format conversion involves going from a text file to an XML file that is standard in the field. Most of the time, the data we get contains the information of interest in a plain text file like below. The actual file has all of this in one line. Reading and splitting this text to get the information is trivial (probably not intuitive) but XML is where the problem is.

[][b-D-GlcpNAc]
    {[(4+1)][b-D-GlcpNAc]
        {[(4+1)][b-D-Manp]
            {[(3+1)][a-D-Manp]
                {[(2+1)][a-D-Manp]{}
            }
        [(6+1)][a-D-Manp]
            {[(3+1)][a-D-Manp]{}
            [(6+1)][a-D-Manp]{}
        }
    }
}

How to interpret this:

Everything of the form w-w-w+ is a sugar that is linked to another one. Linkage is shown by the curly {.
4+1, 3+1 and so on indicate which carbon bonds on one sugar to the other one. So the 4th carbon on the preceding one links to the 1st carbon on the succeeding one.
{} This indicates no additional sugar linked to that sugar
} curlies just close that tier.

You can probably read the XML and figure out how the linkages work. But if you guys would prefer a more detailed explanation, just ask.

What the XML should look like is shown below.

<?xml version="1.0" encoding="UTF-8"?>
<GlydeII>
    <molecule subtype="glycan" id="From_GlycoCT_Translation">
            <residue subtype="base_type" partid="1" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=b-dglc-HEX-1:5" />
            <residue subtype="substituent" partid="2" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=n-acetyl" />
            <residue subtype="base_type" partid="3" ref="http://www.monosaccha开发者_StackOverflow社区rideDB.org/GLYDE-II.jsp?G=b-dglc-HEX-1:5" />
            <residue subtype="substituent" partid="4" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=n-acetyl" />
            <residue subtype="base_type" partid="5" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=b-dman-HEX-1:5" />
            <residue subtype="base_type" partid="6" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue subtype="base_type" partid="7" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue subtype="base_type" partid="8" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue subtype="base_type" partid="9" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue subtype="base_type" partid="10" ref="http://www.monosaccharideDB.org/GLYDE-II.jsp?G=a-dman-HEX-1:5" />
            <residue_link from="2" to="1">
                <atom_link from="N1H" to="C2" to_replace="O2" bond_order="1" />
            </residue_link>
            <residue_link from="3" to="1">
                <atom_link from="C1" to="O4" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="4" to="3">
                <atom_link from="N1H" to="C2" to_replace="O2" bond_order="1" />
            </residue_link>
            <residue_link from="5" to="3">
                <atom_link from="C1" to="O4" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="6" to="5">
                <atom_link from="C1" to="O3" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="7" to="6">
                <atom_link from="C1" to="O2" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="8" to="5">
                <atom_link from="C1" to="O6" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="9" to="8">
                <atom_link from="C1" to="O3" from_replace="O1" bond_order="1" />
            </residue_link>
            <residue_link from="10" to="8">
                <atom_link from="C1" to="O6" from_replace="O1" bond_order="1" />
            </residue_link>
    </molecule>
</GlydeII>

So far I've been trivially able to get all the residue fields and written them to XML. But I'm having trouble even writing pseudo code for the residue_link fields. Even if I can just get help and ideas on how to go about adding the linkage information in the xml I would appreciate it.

Okay! Cool problem, it hurts my brain in a good way.

First... for my sanity I tabbed your raw data into a way that makes sense:

[][b-D-GlcpNAc] {
    [(4+1)][b-D-GlcpNAc] {
        [(4+1)][b-D-Manp] {
            [(3+1)][a-D-Manp] {
                [(2+1)][a-D-Manp] { }
            }
            [(6+1)][a-D-Manp] {
                [(3+1)][a-D-Manp] { }
                [(6+1)][a-D-Manp] { }   
            }
        }
    }

I think that the key to this is figuring out what the pairs are, and you want to programmatically figure out what level you're on.

Pseudocode:

hierarchy = 0
nextChar = getNextChar()
while (Parsing):
    if (nextChar = "{"):
        hierarchy += 1
    elif (nextChar = "}"):
        hierarchy -= 1
    if (nextChar = "["):
        storeSugar(hierarchy)

You'd also want to keep track of which sugar is the previous "parent" sugar.

继续阅读：bioinformatics dom python xml

Extracting Info from Plain Text and Writing to XML Using DOM

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？