Finding HTML strings in document

2022-12-17 19:43 问答作者：

I want to get all HTML ... in a document.

Using Regex to find all such strings using:

Regex regex = new Regex(@"\<p\>([^\>]*)\</p\>", RegexOptions.IgnoreCase);

But I 开发者_开发百科am not able to get any result. Is there anything wrong with my regular expression.?

For now, I just want to get everything that comes in between ... tags and want to use Regex for this as the source is not an HTML document.

DO NOT PARSE HTML USING Regular Expressions!!!

Instead, use the HTML Agility Pack.

For example:

var doc = new HtmlDocument();
doc.Load(...);

var pTags = doc.DocumentNode.Descendants("p");

EDIT: You can do this even if the document isn't actually HTML.

Using a regex for this is not the best idea. I suggest reading this thread:

RegEx match open tags except XHTML self-contained tags

The approach of using a regex to match HTML elements is destined to fail. A regular expression is not capable of reliably matching an HTML element. It's possible to build a more complex HTML element than your regex can match.

For example, i could beat your regex with the following

<p>hello<p>again</p></p>

Instead of using a regex you need to use an HTML (or potentially an XML) parser / DOM. This is the only way to reliably query an HTML file

Detailed Explanation of why:

http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

While others have said that you shouldn't be doing this with regular expressions, the reason yours is failing is that there is more HTML between your  tags and your exclusion of > is causing the Regex to not match.

@"(?is)<p>(?>(?:(?!</?p>).)*)</p>"

(?:(?!</?p>).)* matches one character at a time, after doing a lookahead to make sure it isn't part of a  or  tag.

(?>...) is an atomic group; it prevents backtracking that we know would be pointless.

(?is) is an alternative mechanism for specifying match modifiers--in this case, IgnoreCase and Singleline (the latter in case there are linefeeds or carriage returns between the tags, which would be redundant, but you did say it's not really HTML).

By the way, < and > have no special meaning in regexes, so there's no need to escape them. In fact, in some flavors you can give them special meanings by escaping them: \< and \> mean "beginning of word" and "end of word" respectively. But in .NET regexes the backslashes are just clutter.

You asked for it but really don't do this using Regexps unless you control 100% of the HTML production...

public static Regex regex = new Regex(
      "(?<open>\\<p(?<attr>[^>])*\\>)(?<content>.*)\\</p(?:\\s*)\\>",
    RegexOptions.Multiline
    | RegexOptions.CultureInvariant
    | RegexOptions.Compiled
    );

tested against

<p>hello world</p>
<p style="Foo"></p >
<p>who nests paragraphs <p>in 2010?</p> </p  >
<p /><p><a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ</a></p><p/>

will yield for the content group

"hello world"
""
"who nests paragraphs <p>in 2010?</p>"
"<p><a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ</a>"

so if you are sure there are no  go for it

继续阅读：.net regex

Finding HTML strings in document

DO NOT PARSE HTML USING Regular Expressions!!!

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

DO NOT PARSE HTML USING Regular Expressions!!!

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？