Processing RSS/RDF via xml.dom.minidom
I'm trying to process a delicious rss feed via python. Here's a sample:
...
<item rdf:about="http://weblist.me/">
<title>WebList - The Place To Find The Best List On The Web</title>
<dc:date>2009-12-24T17:46:14Z</dc:date>
<link>http://weblist.me/</link>
...
</item>
<item rdf:about="http://thumboo.com/">
<title>Thumboo! Free Website Thumbnails and PHP Script to Generate Web Screenshots</title>
<dc:date>2006-10-24T18:11:32Z</dc:date>
<link>http://thumboo.com/</link>
...
The relevant code is:
def getText(nodelist开发者_开发知识库):
rc = ""
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc = rc + node.data
return rc
dom = xml.dom.minidom.parse(file)
items = dom.getElementsByTagName("item")
for i in items:
title = i.getElementsByTagName("title")
print getText(title)
I would think this would print out each title, but instead I get basically get blank output. I'm sure I'm doing something stupid wrong, but no idea what?
You are passing the title
nodes to getText
, whose nodeType
s are not node.TEXT_NODE
. You have to loop over all the children of the node instead in your getText
method:
def getTextSingle(node):
parts = [child.data for child in node.childNodes if child.nodeType == node.TEXT_NODE]
return u"".join(parts)
def getText(nodelist):
return u"".join(getTextSingle(node) for node in nodelist)
Even better, call node.normalize()
before calling getTextSingle
which ensures that consecutive children of type node.TEXT_NODE
are merged into a single node.TEXT_NODE
.
精彩评论