Processing RSS/RDF via xml.dom.minidom

2022-12-25 18:30 问答作者：

I'm trying to process a delicious rss feed via python. Here's a sample:

...
  <item rdf:about="http://weblist.me/">
    <title>WebList - The Place To Find The Best List On The Web</title>
    <dc:date>2009-12-24T17:46:14Z</dc:date>
    <link>http://weblist.me/</link>
    ...
  </item>
  <item rdf:about="http://thumboo.com/">
    <title>Thumboo! Free Website Thumbnails and PHP Script to Generate Web Screenshots</title>
    <dc:date>2006-10-24T18:11:32Z</dc:date>
    <link>http://thumboo.com/</link>
...

The relevant code is:

def getText(nodelist开发者_开发知识库):
    rc = ""
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
    return rc

dom = xml.dom.minidom.parse(file)
items = dom.getElementsByTagName("item")
for i in items:
    title = i.getElementsByTagName("title")
    print getText(title)

I would think this would print out each title, but instead I get basically get blank output. I'm sure I'm doing something stupid wrong, but no idea what?

You are passing the title nodes to getText, whose nodeTypes are not node.TEXT_NODE. You have to loop over all the children of the node instead in your getText method:

def getTextSingle(node):
    parts = [child.data for child in node.childNodes if child.nodeType == node.TEXT_NODE]
    return u"".join(parts)

def getText(nodelist):
    return u"".join(getTextSingle(node) for node in nodelist)

Even better, call node.normalize() before calling getTextSingle which ensures that consecutive children of type node.TEXT_NODE are merged into a single node.TEXT_NODE.

继续阅读：python rss

Processing RSS/RDF via xml.dom.minidom

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？