开发者

How do I get Python XML to stop having wasted Child Nodes

I have a simple XML document I'm trying to read in with Python DOM (see below):

XML File:

<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
    <Header>
        <Reserved>2</Reserved>
        <CPU>1</CPU>
        <Flag>1</Flag>
        <VQI>12</VQI>
        <Group_ID>16</Group_ID>
        <DI>2</DI>
        <DE>1</DE>
        <ACOSS>5</ACOSS>
        <RGH>8</RGH>
 开发者_如何学Go   </Header>
</HeaderLookup>

Python Code:

from xml.dom import minidom

xml_file = open("test.xml")
xmlroot = minidom.parse(xml_file).documentElement
xml_file.close()

for item in xmlroot.getElementsByTagName("Header")[0].childNodes:
    print item

Result:

<DOM Text node "u'\n\t\t'">
<DOM Element: Reserved at 0x28d2828>
<DOM Text node "u'\n\t\t'">
<DOM Element: CPU at 0x28d28c8>
<DOM Text node "u'\n\t\t'">
<DOM Element: Flag at 0x28d2968>
<DOM Text node "u'\n\t\t'">
<DOM Element: VQI at 0x28d2a08>
<DOM Text node "u'\n\t\t'">
<DOM Element: Group_ID at 0x28d2ad0>
<DOM Text node "u'\n\t\t'">
<DOM Element: DI at 0x28d2b70>
<DOM Text node "u'\n\t\t'">
<DOM Element: DE at 0x28d2c10>
<DOM Text node "u'\n\t\t'">
<DOM Element: ACOSS at 0x28d2cb0>
<DOM Text node "u'\n\t\t'">
<DOM Element: RGH at 0x28d2d50>
<DOM Text node "u'\n\t'">

The result should be 9 Child Nodes (Reserved, CPU, Flag, VQI, Group_ID, DI, DE, ACOSS, and RGH), but for some reason it is returning a list of 19 nodes with 10 of them being whitespace (why is this even being considered a node in the first place?!). Can anyone tell me if there's a way to get the XML parser to not include whitespace nodes?


Whitespace is significant in XML, but check out ElementTree, which has a different API for processing XML than the DOM.

Example

from xml.etree import ElementTree as et

data = '''\
<?xml version="1.0" encoding="utf-8"?>
<HeaderLookup>
    <Header>
        <Reserved>2</Reserved>
        <CPU>1</CPU>
        <Flag>1</Flag>
        <VQI>12</VQI>
        <Group_ID>16</Group_ID>
        <DI>2</DI>
        <DE>1</DE>
        <ACOSS>5</ACOSS>
        <RGH>8</RGH>
    </Header>
</HeaderLookup>
'''

tree = et.fromstring(data)
for n in tree.find('Header'):
    print n.tag,'=',n.text

Output

Reserved = 2
CPU = 1
Flag = 1
VQI = 12
Group_ID = 16
DI = 2
DE = 1
ACOSS = 5
RGH = 8

Example (extending previous code)

The whitespace is still present, but it is in .tail attributes. tail is the text node that follows an element (between the end of one element and the start of the next), while text is the text node between the start/end tag of an element.

def dump(e):
    print '<%s>' % e.tag
    print 'text =',repr(e.text)
    for n in e:
        dump(n)
    print '</%s>' % e.tag
    print 'tail =',repr(e.tail)

dump(tree)

Output

<HeaderLookup>
text = '\n    '
<Header>
text = '\n        '
<Reserved>
text = '2'
</Reserved>
tail = '\n        '
<CPU>
text = '1'
</CPU>
tail = '\n        '
<Flag>
text = '1'
</Flag>
tail = '\n        '
<VQI>
text = '12'
</VQI>
tail = '\n        '
<Group_ID>
text = '16'
</Group_ID>
tail = '\n        '
<DI>
text = '2'
</DI>
tail = '\n        '
<DE>
text = '1'
</DE>
tail = '\n        '
<ACOSS>
text = '5'
</ACOSS>
tail = '\n        '
<RGH>
text = '8'
</RGH>
tail = '\n    '
</Header>
tail = '\n'
</HeaderLookup>
tail = None
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜