strip HTML and CSS in C#
I'm creating mails in one of my solutions and need to provide both html an开发者_Go百科d plaintext mails from a given html page.
However, I haven't found any real good way to strip html, js and css from whatever html template the customers might provide.
Are there any simple solution to this, perhaps a component that handle all this or do I need to start puzzle with regexp? And is it even possible to create a bulletproof regexp for all possible tags?
Regards
Give HtmlAgilityPack a go. It has methods for extracting the text out of an HTML Document.
You basically just need to do the following:
var doc = new HtmlDocument();
doc.LoadHtml(htmlStr);
var node = doc.DocumentNode;
var textContent = node.InnerText;
As a component that can strip html: Html Agility Pack
Take a look here: HTMLAgilityPack parse in the InnerHTML. There is an answer how to do it using Html Agility Pack
You might find the Html Agility Pack helpful to your situation.
In this page you can find a really fast algorithm to strip HTML from a string input. Although there are some issues with invalid HTML, it's still a great resource. http://www.dotnetperls.com/remove-html-tags
精彩评论