开发者

Finding HTML strings in document

I want to get all HTML <p>...</p> in a document.

Using Regex to find all such strings using:

Regex regex = new Regex(@"\<p\>([^\>]*)\</p\>", RegexOptions.IgnoreCase);

But I 开发者_开发百科am not able to get any result. Is there anything wrong with my regular expression.?

For now, I just want to get everything that comes in between <p>...</p> tags and want to use Regex for this as the source is not an HTML document.


DO NOT PARSE HTML USING Regular Expressions!!!


Instead, use the HTML Agility Pack.

For example:

var doc = new HtmlDocument();
doc.Load(...);

var pTags = doc.DocumentNode.Descendants("p");

EDIT: You can do this even if the document isn't actually HTML.


Using a regex for this is not the best idea. I suggest reading this thread:

RegEx match open tags except XHTML self-contained tags


The approach of using a regex to match HTML elements is destined to fail. A regular expression is not capable of reliably matching an HTML element. It's possible to build a more complex HTML element than your regex can match.

For example, i could beat your regex with the following

<p>hello<p>again</p></p>

Instead of using a regex you need to use an HTML (or potentially an XML) parser / DOM. This is the only way to reliably query an HTML file

Detailed Explanation of why:

  • http://blogs.msdn.com/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx


While others have said that you shouldn't be doing this with regular expressions, the reason yours is failing is that there is more HTML between your <p> tags and your exclusion of > is causing the Regex to not match.


@"(?is)<p>(?>(?:(?!</?p>).)*)</p>"

(?:(?!</?p>).)* matches one character at a time, after doing a lookahead to make sure it isn't part of a <p> or </p> tag.

(?>...) is an atomic group; it prevents backtracking that we know would be pointless.

(?is) is an alternative mechanism for specifying match modifiers--in this case, IgnoreCase and Singleline (the latter in case there are linefeeds or carriage returns between the tags, which would be redundant, but you did say it's not really HTML).

By the way, < and > have no special meaning in regexes, so there's no need to escape them. In fact, in some flavors you can give them special meanings by escaping them: \< and \> mean "beginning of word" and "end of word" respectively. But in .NET regexes the backslashes are just clutter.


You asked for it but really don't do this using Regexps unless you control 100% of the HTML production...

public static Regex regex = new Regex(
      "(?<open>\\<p(?<attr>[^>])*\\>)(?<content>.*)\\</p(?:\\s*)\\>",
    RegexOptions.Multiline
    | RegexOptions.CultureInvariant
    | RegexOptions.Compiled
    );

tested against

<p>hello world</p>
<p style="Foo"></p >
<p>who nests paragraphs <p>in 2010?</p> </p  >
<p /><p><a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ</a></p><p/>

will yield for the content group

"hello world"
""
"who nests paragraphs <p>in 2010?</p>"
"<p><a href="http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454">TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ</a>"

so if you are sure there are no <p/> go for it

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜