How do I get Python XML to stop having wasted Child Nodes
I have a simple XML document I'm trying to read in with Python DOM (see below):
XML File:
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
开发者_如何学Go </Header>
</HeaderLookup>
Python Code:
from xml.dom import minidom
xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()
for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
print item
Result:
<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">
The result should be 9 Child Nodes (Reserved, CPU, Flag, VQI, Group_ID, DI, DE, ACOSS, and RGH), but for some reason it is returning a list of 19 nodes with 10 of them being whitespace (why is this even being considered a node in the first place?!). Can anyone tell me if there's a way to get the XML parser to not include whitespace nodes?
Whitespace is significant in XML, but check out ElementTree, which has a different API for processing XML than the DOM.
Example
from xml.etree import ElementTree as et
data = '''\
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
<Header>
<Reserved>2</Reserved>
<CPU>1</CPU>
<Flag>1</Flag>
<VQI>12</VQI>
<Group_ID>16</Group_ID>
<DI>2</DI>
<DE>1</DE>
<ACOSS>5</ACOSS>
<RGH>8</RGH>
</Header>
</HeaderLookup>
'''
tree = et.fromstring(data)
for n in tree.find('Header'):
print n.tag,'=',n.text
Output
Reserved = 2
CPU = 1
Flag = 1
VQI = 12
Group_ID = 16
DI = 2
DE = 1
ACOSS = 5
RGH = 8
Example (extending previous code)
The whitespace is still present, but it is in .tail
attributes. tail
is the text node that follows an element (between the end of one element and the start of the next), while text
is the text node between the start/end tag of an element.
def dump(e):
print '<%s>' % e.tag
print 'text =',repr(e.text)
for n in e:
dump(n)
print '</%s>' % e.tag
print 'tail =',repr(e.tail)
dump(tree)
Output
<HeaderLookup>
text = '\n '
<Header>
text = '\n '
<Reserved>
text = '2'
</Reserved>
tail = '\n '
<CPU>
text = '1'
</CPU>
tail = '\n '
<Flag>
text = '1'
</Flag>
tail = '\n '
<VQI>
text = '12'
</VQI>
tail = '\n '
<Group_ID>
text = '16'
</Group_ID>
tail = '\n '
<DI>
text = '2'
</DI>
tail = '\n '
<DE>
text = '1'
</DE>
tail = '\n '
<ACOSS>
text = '5'
</ACOSS>
tail = '\n '
<RGH>
text = '8'
</RGH>
tail = '\n '
</Header>
tail = '\n'
</HeaderLookup>
tail = None
精彩评论