Strip ALL HTML from a String?

2023-02-13 13:49 问答作者：

I've seen regex that can remove tags, which is great, but I also have stuff like

&nbsp;

etc.

This isn't actually from a HTML file. It's actually from a string. I'm pulling down data from SharePoint web services, which gives me the HTML users might use/get generated like

<div>Hello! Please remember to clean the break room!!! &quot;bob&quote; <B开发者_如何学CR> </div>

So, I'm parsing through 100-900 rows with 8-20 columns each.

Take a look at the HTML Agility Pack, it's an HTML parser that you can use to extract the InnerText from HTML nodes in a document.

As has been pointed out many times here on SO, you can't trust HTML parsing to a regular expression. There are times when it might be considered appropriate (for extremely limited tasks); but in general, HTML is too complex and too prone to irregularity. Bad things can happen when you try to parse HTML with Regular Expressions.

Using a parser such as HAP gives you much more flexibility. A (rough) example of what it might look like to use it for this task:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load("path to your HTML document");

StringBuilder content = new StringBuilder();
foreach (var node in doc.DocumentNode.DescendantNodesAndSelf())
{
    if (!node.HasChildNodes)
    {
        sb.AppendLine(node.InnerText);
    }
}

You can also perform XPATH queries on your document, in case you're only interested in a specific node or set of nodes:

var nodes = doc.DocumentNode.SelectNodes("your XPATH query here");

Hope this helps.

继续阅读：.net

Strip ALL HTML from a String?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？