Best way transform custom XML like syntax

2023-03-20 10:56 问答作者：

Using Python.

So basically I have a XML like tag syntax but the tags don't have attributes. So <a> but not <a value='t'>. They close regularly with </a>.

Here is my question. I have something that looks like this:

<al>
1. test
2. test2
 test with new line
3.  test3
<al>
    1. test 4
    <al>
        2. test 5
        3. test 6
        4. test 7
    </al>
</al>
4. test 8
</al>

And I want to transform it into:

<al>
<li>test</li>
<li> test2</li>
<li> test with new line</li>
<li>  test3
<al>
    <li> test 4 </li>
    <al>
        <li> test 5</li>
        <li> test 6</li>
        <li> test 7</li>
    </al>
    </li>
</al>
</li>
<li> test 8</li>
</al>

I'm not really looking for a completed solution but rather a push into the right direction. I am just wondering how the folks here would approach the problem. Solely REGEX? write a full custom parser for the attribute-less tag syntax? Hacking up existing XML parsers? etc.

Thanks in advance开发者_JS百科

I'd recommend start with the following:

from xml.dom.minidom import parse, parseString

xml = parse(...)
l = xml.getElementsByTagName('al')

then traverse all elements in l, examining their text subnodes (as well as <al> nodes recursively).

You may start playing with this right away in the Python console.

It is easy to remove text nodes, then split text chunks with chunk.split('\n') and add <li> nodes back, as you need.

After modifying all the <al> nodes you may just call xml.toxml() to get the resulting xml as text.

Note that the element objects you get from this are linked back to the original xml document object, so do not delete the xml object in the process.

This way I personally consider more straightforward and easy to debug than mangling with multiline regexps.

The way you've described your syntax, it is "XML without attributes". If that's so, it's still XML, so you can use XML tools such as XSLT and XQuery.

If you allow things that aren't allowed in XML, on the other hand, my approach would be to write a parser that handles your non-XML format and delivers XML-compatible SAX events. Then you'll be able to use any XML technology just by plugging in your parser in place of the regular XML parser.

It would depend on what you want to do with it exactly, if it is a one-of script the following suffices:

cat in.txt | perl -pe 'if(!/<\/?al>/){s#^(\s*)([0-9]+\.)?(.*)$#$1<li>$3</li>#}'

And it works. But I wouldn't say it's very robust ;) But if it's for a one-off it's fine.

I am just wondering how the folks here would approach the problem.

I would go for using a parser.

My reasoning is that the operation your are trying to perform isn't merely a syntactic or lexical substitution. It's much more of a grammar transformation, which imply understanding the structure of your document.

In your example, you are not simply enclosing each line between <li> and </li>; you are also enclosing recursively some blocks of document that spans over several lines, if these represent an "item".

Maybe you could put together a regex capable of capturing the interpretative logic and the recursive nature of the problem, but doing that would be like digging a trench with a teaspoon: you could do it, but using a spade (a parser) is a much more logical choice.

An additional reason to use a parser is the "real word". Regex are true "grammar nazis": a glitch in your markup and they won't work. On the other hand, all parser libraries are "flexible" (treat uniformly different spellings like <a></a> and <a/> or HTML's <br> and XHTML's <br/>) and some - like beautifulsoup - are even "forgiving", meaning that they will try to guess (with a surprisingly high level of accuracy) what the document's author wanted to code, even if the document itself fails validation.

Also, a parser-based solution is much more maintainable than a regex-based one. A small change in your document structure might need radical changes of your regex [which by nature tend to become obscure to their very own author after 72 hours or so].

Finally, because you are using python and therefore readability counts, a parser-based solution could potentially result in much more pythonic code than very complex/long/obscure regex.

HTH!

继续阅读：parsing python tags xml

Best way transform custom XML like syntax

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？