开发者

string remove htmls

I would like a regex to remove html tags and &nbsp, &quot etc from a string. The regex I have is to remove the html tags but not the others mentioned. I'm using .Net 4

Thanks

CODE:

     String result = Regex.Replace(blogText, @"<[^>]*>"开发者_运维百科, String.Empty);


Don't use Regular Expressions, use the HTML Agility pack:

http://www.codeplex.com/htmlagilitypack


If you want to build on what you what you already created, you can change it to the following:

String result = Regex.Replace(blogText, @"<[^>]*>|&\w+", String.Empty);

It means...

  1. Either match tags as you defined...
  2. ...or match a & followed by at least one word character \w -- as many as possible.

Neither of these two work in all nasty cases, but usually it does.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜