Manipulating well-formed xml (in any language running under linux)
I have well-formed xml (open-tags are closed, etc), but there's no dtd, namespaces are not always correct, and there are random entities.
I found an error in some of my xml files, and want to fix this, automatically. Essentially the xml file looks like this:
<foo>
<bar> hi </bar>
<!-- ... -->
<math><sometag><another>bar</another></sometag></math>
<!-- ... -->
</foo>
I want to change this to
<foo>
<bar> hi </bar>
<!-- ... -->
<m:math><m:sometag><m:another>bar</m:another></m:sometag></m:math>
<!-- ... -->
</foo>
I looked at Python elementtree, but according to diveintopython it will not like the fact that it is not validating xml? Also, it is important that nothing should be changed except the prefixing with m:
.
Since I'm writing a bunch of shell-scripts to fix files I don't really care for the language, though my current weapon of choice is Python.
Clarifications:
- the xml does pass when executing xmllint on it
- I really want a xml solution, because parsing xml using regexes is way to flakey
- I don't know the name开发者_运维问答s of the tags which can be between
<math>
and</math>
- no modification should be made to the document except the prefixing of aforementioned tags with
m:
In Perl you could use XML::Twig, for example like this:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
XML::Twig->new( twig_roots => { math => \&add_prefix },
twig_print_outside_roots => 1,
)
->parse( \*DATA);
sub add_prefix
{ my( $t, $math)= @_;
foreach my $m ( $math, $math->descendants( '#ELT'))
{ $m->set_tag( "m:" . $m->tag); }
$t->flush;
}
__DATA__
<foo>
<bar> hi </bar>
<!-- ... -->
<math><sometag><another>bar</another></sometag></math>
<!-- ... -->
</foo>
A one-liner in Perl ok?
$ perl -lne'm!<math>.*</math>! and s!<(/)?([^>]+)>!<$1m:$2>!gm;print' 5351382.txt
<foo>
<bar> hi </bar>
<!-- ... -->
<m:math><m:sometag><m:another>bar</m:another></m:sometag></m:math>
<!-- ... -->
</foo>
You shouldn't really parse XML this way... but if the above is sufficient for you... ;)
In Ruby, using Nokogiri to massage the XML:
xml = <<EOT
<foo>
<bar> hi </bar>
<!-- ... -->
<math><sometag><another>bar</another></sometag></math>
<!-- ... -->
</foo>
EOT
NAMESPACE = %w[m http://host.com/m]
require 'nokogiri'
doc = Nokogiri::XML::DocumentFragment.parse(xml)
ns = doc.at('foo').add_namespace_definition(*NAMESPACE)
doc.xpath('foo/math | foo/math//*').each { |n| n.namespace = ns }
puts doc.to_xml
The output looks like:
>> <foo xmlns:m="http://host.com/m">
>> <bar> hi </bar>
>> <!-- ... -->
>> <m:math><m:sometag><m:another>bar</m:another></m:sometag></m:math>
>> <!-- ... -->
>> </foo>
If the namespace can't be added to <foo>
, then you can munge the tag names directly without messing with namespaces:
xml = <<EOT
<foo>
<bar> hi </bar>
<!-- ... -->
<math><sometag><another>bar</another></sometag></math>
<!-- ... -->
</foo>
EOT
NAMESPACE = %w[m http://host.com/m]
require 'nokogiri'
doc = Nokogiri::XML::DocumentFragment.parse(xml)
doc.xpath('foo/math | foo/math//*').each { |n| n.name = "m:" << n.name }
puts doc.to_xml
# >> <foo>
# >> <bar> hi </bar>
# >> <!-- ... -->
# >> <m:math><m:sometag><m:another>bar</m:another></m:sometag></m:math>
# >> <!-- ... -->
# >> </foo>
Your best bet will probably be to find a non-validating XSLT processor and pass it something like:
<xsl:template match="math">
<m:math>
<xsl:apply-templates select="@*|node()"/>
</m:math>
</xsl:template>
Perhaps BeautifulSoup will serve you better than Python's built-in stuff. It's mainly designed for HTML, but can do XML as well, although...
The BeautifulSoup class is full of web-browser-like heuristics for divining the intent of HTML authors. But XML doesn't have a fixed tag set, so those heuristics don't apply. So BeautifulSoup doesn't do XML very well.
It might not be perfect, but probably fares better on unspecified or invalid XML than a strict parser does. Another point in its favour is that it gives you Unicode, dammit.
精彩评论