开发者

Using Regex, how do I find text that is divided by another group characters?

I'm looking for the html end tag in an mhtml file. The html is in fixed-width rows with a line break at the end like this:

size:12pt">Insert an image into the document here.</span></p><p style=3D"ma=
rgin:0pt 0pt 3pt; text-align:center"><img src=3D"image.001.png" width=3D"20=
0" height=3D"200" alt=3D"" /></p><p style=3D"margin:0pt 0pt 3pt"><span styl=
e=3D"font-family:Arial; font-size:12pt">&#xa0;</span></p></div></body></htm=
l>

Notice the </html> end tag is split in the middle by "=\n".

How can I find the </html> end tag regardless of where it is split?

I can find a sing开发者_如何学Pythonle permutation using Regex similar to the following, but I'd like to do it in one shot.

<((=\n)?/html>)
</((=\n)?html>)
</h((=\n)?tml>)
</ht((=\n)?ml>)
etc...

I've read RegEx match open tags except XHTML self-contained tags and read the post at http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html among others, but I still think the question is valid.

I'm not making an html parsing engine. I'm just looking for one very specific pattern. And... this has to go out tomorrow. All great reasons to do this down and dirty solution >:D


<(=\n)?/(=\n)?h(=\n)?t(=\n)?m(=\n)?l(=\n)?>


Just use a Regex.Replace() and look for =\r\n and replace it with a String.Empty. Then you can do your matches without intervention?


HTML is not a regular language ... it doesn't lend itself to processing using regular expressions.

Tasks like brace or tag counting/matching are can't be done correctly for arbitrary input using regular expressions.

You should really use an actual HTML parser to do so, not regex.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜