How do you recover from invalid tags in an rss feed
I am working on a RSS feed reader. Some feeds have invalid tags like <i> and <b> in them (invalid for RSS). I get an exception when I parse them.
To demo the error, I posted sample code. Here is some info:
Exception message: Unexpected node type Element. ReadElementString method can only be called on elements with simple or empty content.
Exception: System.Xml.XmlException.
Raw XML See the XML for this rss: http://www.npr.org/rss/rss.php?id=1001. See the page source. The issue is on line 56 (<a> tag in rss)
Exception comments: If you look at the raw RSS, there is an <a> tag in it. The rss parser does not like this so it throws an exception on it. The error is in line 34 (Console.WriteLine(ex.Message);)
Is there a nice to to either process HTML tags in Rss feeds or to ignore them?
Note: I added Microsoft's code to extend XmlTextReader class. It is a means of bypassing invalid date in rss. Ignore that. I added it to the code to fix an irrelevant bug from Microsoft.
Here is a sample code that you can run to see the exception:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.ServiceModel.Syndication;
using System.Xml;
using System.Globalization;
using System.IO;
namespace RssTest
{
class Program
{
static void Main(string[] args)
{
DoRSS();
}
public static void DoRSS()
{
string url = "https://west.thomson.com/about/feeds/west_prfeed.xml";
var r = new MyXmlReader(url);
SyndicationFeed feed = SyndicationFeed.Load(r);
Rss20FeedFormatter rssFormatter = feed.GetRss20Formatter();
XmlTextWriter rssWriter = new XmlTextWriter("rss.xml", Encoding.UTF8);
rssWriter.Formatting = Formatting.Indented;
rssFormatter.WriteTo(rssWriter);
rssWriter.Close();
foreach (var i in feed.Items)
{
Console.WriteLine(i.Summary.Text);
}
}
}
//from microsoft
public class MyXmlReader : XmlTextReader
{
private bool readingDate = false;
const string CustomUtcDateTimeFormat = "ddd MMM dd HH:mm:ss Z yyyy"; // Wed Oct 07 08:00:07 GMT 2009
public MyXmlReader(Stream s) : base(s) { }
public MyXmlReader(string inputUri) : base(inputUri) { }
public override void ReadStartElement()
{
if (string.Equals(base.NamespaceURI, strin开发者_高级运维g.Empty, StringComparison.InvariantCultureIgnoreCase) &&
(string.Equals(base.LocalName, "lastBuildDate", StringComparison.InvariantCultureIgnoreCase) ||
string.Equals(base.LocalName, "pubDate", StringComparison.InvariantCultureIgnoreCase)))
{
readingDate = true;
}
base.ReadStartElement();
}
public override void ReadEndElement()
{
if (readingDate)
{
readingDate = false;
}
base.ReadEndElement();
}
public override string ReadString()
{
if (readingDate)
{
string dateString = base.ReadString();
DateTime dt;
if (!DateTime.TryParse(dateString, out dt))
dt = DateTime.ParseExact(dateString, CustomUtcDateTimeFormat, CultureInfo.InvariantCulture);
return dt.ToUniversalTime().ToString("R", CultureInfo.InvariantCulture);
}
else
{
return base.ReadString();
}
}
}
}
Blockquote
Here is a solution worth examining:
http://www.eggheadcafe.com/tutorials/aspnet/9faa101f-0a1a-465f-a41a-3e52dd9f7526/everything-rss--atom-f.aspx
You can't, really. If the data is not valid XML, then it's not valid XML and the feed owner needs to fix it. Those tags need to be escaped, or else placed inside of a CDATA section.
精彩评论