Linq XML How to ignore html code?
I am using Xelement - Linq to XML to parse some an RSS feed.
Rss Example:
<item>
<title>Waterfront Ice Skating</title>
<link>http://www.eventfinder.co.nz/2011/sep/wellington/wellington-waterfront-ice-skating?utm_medium=rss</link>
<description><p>An ice skating rink in Wellington for a limited time only!
Enjoy the magic of the New Zealand winter at an outdoor skating experience with all the fun and atmosphere of New York&#039;s Rockefeller Centre or Central Park, ...</p><p>Wellington | Friday, 30 September 2011 - Sunday, 30 October 2011</p></description>
<content:encoded><![CDATA[Today, Wellington Waterfront<br/>Wellington]]></content:encoded>
<guid isPermalink="false">108703</guid>
<pubDate>2011-09-30T10:00:00Z</pubDate>
<enclosure url="http://s1.eventfinder.co.nz/uploads/events/transformed/190501-108703-13.jpg" length="5000" type="image/jpeg"></enclosure>
</item>
Its all working fine but the description element has alot of html markup that I need to remove.
Description:
<description><p>An ice skating rink in Wellington for a limited time only!
Enjoy the magic of the New Zealand winter at an outdoor skating experience with all the fun and atmosphere of New York&#039;s Rockefeller Centre or Central Park, ...</p><p>Wellington | Friday, 30 September 2011 - Sunday, 30 October 2011</p></description>
Could anyone 开发者_开发百科assist with this?
If it is a RSSFeed why don't you use System.ServiceModel.Syndication, the SyncicationFeed in combination with a XML reader will deal with your XmlEncoded issues
using (XmlReader reader = XmlReader.Create(@"C:\\Users\\justMe\\myXml.xml"))
{
SyndicationFeed myFeed = SyndicationFeed.Load(reader);
...
}
Then remove HTML-Tags with regex as suggested by @nemesv, or use something like this
public static string StripHTML(this string htmlText)
{
var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
return HttpUtility.HtmlDecode(reg.Replace(htmlText, string.Empty));
}
First you should HtmlDecode the content of the descirptoin with System.Net.HttpUtility.HtmlDecode. This replaces the encoded < ;p> ;
to <p>
and then you can remove the HTML tags with regex: Using C# regular expressions to remove HTML tags or with some other HTML parsing library.
精彩评论