开发者

Best way transform custom XML like syntax

Using Python.

So basically I have a XML like tag syntax but the tags don't have attributes. So <a> but not <a value='t'>. They close regularly with </a>.

Here is my question. I have something that looks like this:

<al>
1. test
2. test2
 test with new line
3.  test3
<al>
    1. test 4
    <al>
        2. test 5
        3. test 6
        4. test 7
    </al>
</al>
4. test 8
</al>

And I want to transform it into:

<al>
<li>test</li>
<li> test2</li>
<li> test with new line</li>
<li>  test3
<al>
    <li> test 4 </li>
    <al>
        <li> test 5</li>
        <li> test 6</li>
        <li> test 7</li>
    </al>
    </li>
</al>
</li>
<li> test 8</li>
</al>

I'm not really looking for a completed solution but rather a push into the right direction. I am just wondering how the folks here would approach the problem. Solely REGEX? write a full custom parser for the attribute-less tag syntax? Hacking up existing XML parsers? etc.

Thanks in advance开发者_JS百科


I'd recommend start with the following:

from xml.dom.minidom import parse, parseString

xml = parse(...)
l = xml.getElementsByTagName('al')

then traverse all elements in l, examining their text subnodes (as well as <al> nodes recursively).

You may start playing with this right away in the Python console.

It is easy to remove text nodes, then split text chunks with chunk.split('\n') and add <li> nodes back, as you need.

After modifying all the <al> nodes you may just call xml.toxml() to get the resulting xml as text.

Note that the element objects you get from this are linked back to the original xml document object, so do not delete the xml object in the process.

This way I personally consider more straightforward and easy to debug than mangling with multiline regexps.


The way you've described your syntax, it is "XML without attributes". If that's so, it's still XML, so you can use XML tools such as XSLT and XQuery.

If you allow things that aren't allowed in XML, on the other hand, my approach would be to write a parser that handles your non-XML format and delivers XML-compatible SAX events. Then you'll be able to use any XML technology just by plugging in your parser in place of the regular XML parser.


It would depend on what you want to do with it exactly, if it is a one-of script the following suffices:

cat in.txt | perl -pe 'if(!/<\/?al>/){s#^(\s*)([0-9]+\.)?(.*)$#$1<li>$3</li>#}'

And it works. But I wouldn't say it's very robust ;) But if it's for a one-off it's fine.


I am just wondering how the folks here would approach the problem.

I would go for using a parser.

My reasoning is that the operation your are trying to perform isn't merely a syntactic or lexical substitution. It's much more of a grammar transformation, which imply understanding the structure of your document.

In your example, you are not simply enclosing each line between <li> and </li>; you are also enclosing recursively some blocks of document that spans over several lines, if these represent an "item".

Maybe you could put together a regex capable of capturing the interpretative logic and the recursive nature of the problem, but doing that would be like digging a trench with a teaspoon: you could do it, but using a spade (a parser) is a much more logical choice.

An additional reason to use a parser is the "real word". Regex are true "grammar nazis": a glitch in your markup and they won't work. On the other hand, all parser libraries are "flexible" (treat uniformly different spellings like <a></a> and <a/> or HTML's <br> and XHTML's <br/>) and some - like beautifulsoup - are even "forgiving", meaning that they will try to guess (with a surprisingly high level of accuracy) what the document's author wanted to code, even if the document itself fails validation.

Also, a parser-based solution is much more maintainable than a regex-based one. A small change in your document structure might need radical changes of your regex [which by nature tend to become obscure to their very own author after 72 hours or so].

Finally, because you are using python and therefore readability counts, a parser-based solution could potentially result in much more pythonic code than very complex/long/obscure regex.

HTH!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜