Python-generated RSS: outputting raw HTML?
I'm using PyRSS2Gen and I would like to publish raw HTML (specifically, a couple of images) with each item in my feed.
However, looking at the source it seems the constructor for RSSItem does not accept 'image' and all HTML is auto-escaped - is there any clever way I can get round this?
I found this post, but the code exa开发者_运维技巧mple doesn't seem to work.
I'm not attached to PyRSS2Gen if anyone has a better solution. Maybe I should just write my own RSS feed?
Thanks!
I learned from painful experience the PyRSS2Gen isn't the way to go for this. The problem is that PyRSS2Gen uses python's sax library, specifically saxutility.xmlwriter, which escapes all characters that need escaping in XML, including angle brackets. So even if you extend PyRSS2Gen to add a tag, it will still have a problem.
Typically, I've seen html in RSS (which is XML, not html) wrapped as a CDATA section. Python's sax library has no concept of CDATA, but minidom does. So what I did was drop PyRSS2Gen, add some extra lines of my own code, and use minidom to generate the XML.
You only need Document from minidom (from xml.dom.minidom import Document)
You build the document like:
doc = Document()
rss=doc.createElement('rss')
rss.setAttribute('version', '2.0')
doc.appendChild(rss)
channel=doc.createElement('channel')
rss.appendChild(channel)
channelTitle=doc.createElement('title')
channel.appendChild(channelTitle)
etc., and then generate the xml (RSS) file when you're done:
f = open('whitegrass.xml', "w")
doc.writexml(f)
f.close()
I was the person who wrote the blog post you listed. I copied the code from the gist and just ran it under Kubuntu 11.10, after installing PyRSSGen2, and produced code without a problem. Take a look in the test.xml file, it should look like this;
<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/">
<channel>
<title>Example Title</title>
<link>http://example.com</link>
<description>Example RSS Output</description>
<pubDate>Thu, 27 Oct 2011 05:36:27 GMT</pubDate>
<lastBuildDate>Thu, 27 Oct 2011 05:36:27 GMT</lastBuildDate>
<generator>PyRSS2Gen-1.0.0</generator>
<docs>http://blogs.law.harvard.edu/tech/rss</docs>
<item>
<title>Item Title</title>
<link>http://example.com</link>
<media:thumbnail url="http://example.com/image.jpg"></media:thumbnail>
<description>< ![CDATA[<p><b>example</b>text<p><br/>
<p>Where are you going today?</p>
]]></description>
<guid>random_guid_x9129319</guid>
<pubDate>Thu, 27 Oct 2011 14:36:27 GMT</pubDate>
</item>
</channel>
</rss>
I'll try to explain below how that code works, for posterity's sake.
Much like ViennaMike above said, PyRSS2Gen uses the built in SAX Library, which automatically escapes the HTML. There are, however, ways to get around this. In the code fragment you mentioned, I overrode PyRSS2Gen's "RSSItem" so that when it output "description", it would not actually output anything. (This was what was behind the inclusion of the "NoOutput" class).
Since description is not being output, we have to add a method to attach it to the output ourselves. Hence, the "publish_extensions" code (which outputs both the media_thumbnail and description tags).
I can see that it is somewhat confusing (since you don't need a media_thumbnail class) so I've gone ahead and re-written the class so there is no "Media Thumbnail" class to muck things up for you.
# This is insecure, and only here for a proof of concept. Your mileage may vary. Et cetra.
import PyRSS2Gen
import datetime
class NoOutput:
def __init__(self):
pass
def publish(self, handler):
pass
class IPhoneRSS2(PyRSS2Gen.RSSItem):
def __init__(self, **kwargs):
PyRSS2Gen.RSSItem.__init__(self, **kwargs)
def publish(self, handler):
self.do_not_autooutput_description = self.description
self.description = NoOutput() # This disables the Py2GenRSS "Automatic" output of the description, which would be escaped.
PyRSS2Gen.RSSItem.publish(self, handler)
def publish_extensions(self, handler):
handler._out.write('<%s>< ![CDATA[%s]]></%s>' % ("description", self.do_not_autooutput_description, "description"))
# How to use:
rss = PyRSS2Gen.RSS2(
title = "Example Title",
link="http://example.com",
description="Example RSS Output",
lastBuildDate=datetime.datetime.utcnow(),
pubDate=datetime.datetime.utcnow(),
items=[
IPhoneRSS2(
title="Item Title",
description="""<p><b>example</b>text<p><br/>
<p>Where are you going today?</p>
""",
link="http://example.com",
guid="random_guid_x9129319",
pubDate=datetime.datetime.now()),
]
)
rss.rss_attrs["xmlns:media"] = "http://search.yahoo.com/mrss/"
rss.write_xml(open("test.xml", "w"), "utf-8")
You mention that you want to include an image in your feed; are you including the HTML for your image in the description tag, or is it elsewhere? If it is elsewhere, can you provide a sample RSS feed so I can make appropriate changes for your situation?
jbm's answer is good. Just an add-up: Python2.7.5 changed sax library, so we need to modify jbm's code:
def publish_extensions(self, handler):
handler._write('<%s><![CDATA[%s]]></%s>' % ("description", self.do_not_autooutput_description, "description"))
精彩评论