Recursive regex for div tags(not trying to parse html with regex)

2023-03-06 06:17 问答作者：

I have a bunch of wiki markup, sometimes people just throw random html down in the middle of wiki markup and somehow wikipedia just rolls with it, as it does for all kinds of other badly formed wiki markup. I want to match everything inside the divs.

I need to recursively find all the <div>blah</div> tags including div tags with other div tags inside them. I am trying to match the div tags and everything inside of them. I have this which I believe almost works:

new Regex(@"\<div.*?\> (?<DEPTH>)                   # opening 
            (?>                # now match...
               [^(\<div.*?\>)(\<\/div\>)]+          # any characters except divs
            |                  # or
               \<div.*?\>  (?<DEPTH>)  # a opening div, increasing the depth counter
            |                  # or
               \<\/div\>  (?<-DEPTH>) # a closing div, decreasing the depth counter
            )*                 # any number of times
            (?(DEPTH)(?!))     # until the depth counter is zero again
          \<\/div\>                   # then match the closing fix",
            RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

Maybe I should be using another methodology to parse this but at this point this is the final regex statement that I need.

Here is an example:

<div class="infobox sisterproject" style="font-size: 90%; padding: .5em 1em 1em 1em;">
<div style="text-align:center;">
Find more about '''{{{display|{{{1|{{PAGENAME}}}}}}}}''' on Wikipedia's [[Wikipedia:Wikimedia sister projects|sister projects]]:
</div><!--
-->{{#ifeq:{{{wikt}}}|no||<!--
-->[[File:Wiktionary-logo-en.svg|25px|link=wikt:Specia开发者_开发百科l:Search/{{{wikt|{{{1|{{PAGENAME}}}}}}}}|Search Wiktionary]] [[wikt:Special:Search/{{{wikt|{{{1|{{PAGENAME}}}}}}}}|Definitions]] from Wiktionary<br />}}<!--
-->{{#ifeq:{{{b}}}|no||<!--
-->[[File:Wikibooks-logo.svg|25px|link=b:Special:Search/{{{b|{{{1|{{PAGENAME}}}}}}}}|Search Wikibooks]] [[b:Special:Search/{{{b|{{{1|{{PAGENAME}}}}}}}}|Textbooks]] from Wikibooks<br />}}<!--
-->{{#ifeq:{{{q}}}|no||<!--
-->[[File:Wikiquote-logo.svg|25px|link=q:Special:Search/{{{q|{{{1|{{PAGENAME}}}}}}}}|Search Wikiquote]] [[q:Special:Search/{{{q|{{{1|{{PAGENAME}}}}}}}}|Quotations]] from Wikiquote<br />}}<!--
-->{{#ifeq:{{{s}}}|no||{{#ifeq:{{{author|no}}}|yes|<!--
-->[[File:Wikisource-logo.svg|25px|link=s:Special:Search/Author:{{{s|{{{1|{{PAGENAME}}}}}}}}|Search Wikisource]] [[s:Special:Search/Author:{{{s|{{{1|{{PAGENAME}}}}}}}}|Source texts]] from Wikisource<br />|<!--
-->[[File:Wikisource-logo.svg|25px|link=s:Special:Search/{{{s|{{{1|{{PAGENAME}}}}}}}}|Search Wikisource]] [[s:Special:Search/{{{s|{{{1|{{PAGENAME}}}}}}}}|Source texts]] from Wikisource<br />}}}}<!--
-->{{#ifeq:{{{commons}}}|no||<!--
-->[[File:Commons-logo.svg|25px|link=commons:Special:Search/{{{commons|{{{1|{{PAGENAME}}}}}}}}|Search Commons]] [[commons:Special:Search/{{{commons|{{{1|{{PAGENAME}}}}}}}}|Images and media]] from Commons<br />}}<!--
-->{{#ifeq:{{{n}}}|no||<!--
-->[[File:Wikinews-logo.svg|25px|link=n:Special:Search/{{{n|{{{1|{{PAGENAME}}}}}}}}|Search Wikinews]] [[n:Special:Search/{{{n|{{{1|{{PAGENAME}}}}}}}}|News stories]] from Wikinews<br />}}<!--
-->{{#ifeq:{{{v}}}|no||<!--
-->[[File:Wikiversity-logo-Snorky.svg|25px|link=v:Special:Search/{{{v|{{{1|{{PAGENAME}}}}}}}}|Search Wikiversity]] [[v:Special:Search/{{{v|{{{1|{{PAGENAME}}}}}}}}|Learning resources]] from Wikiversity<br />}}<!--
-->{{#ifeq:{{{species<includeonly>|no</includeonly>}}}|no||<!--
-->[[File:Wikispecies-logo.svg|25px|link=species:Special:Search/{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}|Search Wikispecies]] [[species:Special:Search/{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}|{{{species<noinclude>|{{{1|{{PAGENAME}}}}}</noinclude>}}}]] from Wikispecies}}
</div><noinclude>

Thanks

I think it is not a good idea to parse the html with regex you could use the Html Agility pack

 new Regex(@"<div\b[^>]*>(?><div\b[^>]*>(?<DEPTH>)|</div>(?<-DEPTH>)|.?)*(?(DEPTH)(?!))</div>", RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled | RegexOptions.IgnoreCase | RegexOptions.Singleline);

In the time it took me to fix my expression I would not even be half way done with getting html agility pack up and working.

继续阅读：.net regex

Recursive regex for div tags(not trying to parse html with regex)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？