Parsing problematic XML in Querypath (dots in elements)
I am trying to parse an NewsML (http://www.iptc.org/std/NewsML-G2/2.7/examples/LISTING2_NewsML-G2_C开发者_运维百科omplete.xml) document with querypath. But I have trouble with the dots in some elements, like <body.head>
.
In some firefox querypath plugins I am able to escape the dot with a backslash, but in the php pear library this does not work.
Any ideas?
(I am looking for solution within Querypath, not for workarounds)
In the past, I've used the Tidy PHP extension (http://us3.php.net/manual/en/book.tidy.php) to clean up HTML/XML before passing it into QueryPath.
The XML you referenced above is pretty clean, and also pretty small.
If the only issue is dots in element names, preprocessing with a regular expression would probably work, too. And it would be the fastest solution. I'm guessing you could do a preg_replace('/<body\./g', '<body-', $xml)
and have it fixed. (That would replace body.content
with body-content
and so on.)
精彩评论