How to extract string between 2 markers using Regex in .NET?

2023-04-06 07:46 问答作者：

I have a source to a web page and I need to extract the body. So anything between </head><body> and </body></html>.

I've tried the following with no success:

var match = Regex.Match(output, @"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");

It finds a 开发者_运维问答string but cuts it off long before </body></html>. I escaped characters based on the RegEx cheat sheet.

What am i missing?

I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.

The latest version even supports Linq so you can get your content like this:

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;

Regex is not meant for such html handling, as many here would say. Without having your sample web page / html, I can only say that try removing the non-greedy ? quantifier in (.*?) and try. After all, a html page will have only one head and body.

Though regexes are definitely not the best tool for this task, there are a few suggestions and points I would like to make:

un-escape the angle brackets - with the @ before your string, they are going through to the regex and they do not need to be escaped for a .NET regex
with your regex, you need to make sure that the head/body tag combinations do not have any white-space between them.
with your regex, the body tag cannot have any attributes.

I would suggest something more like:

(?<=</head>\s*<body(\s[^>]*)?>)(.*?)(?=</body>\s*</html>)

this seems to work for me on the source of this page!

As the others have said, the correct way to handle this is with an HTML-specific tool. I just want to point out some problems with that cheat-sheet.

First, it's wrong about angle brackets: you do not need to escape them. In fact, it's wrong twice: it also says \< and \> match word boundaries, which is both incorrect for .NET, and incompatible with the advice about escaping angle brackets.

That cheat-sheet is just a random collection of regex syntax elements; most of them will work in most flavors, but many are guaranteed not to work in your particular flavor, whatever it happens to be. I recommend you disregard it and rely instead on .NET-specific documents or Regular-Expressions.info. The books Mastering Regular Expressions and Regular Expressions Cookbook are both excellent, too.

As for your regex, I don't see how it could behave the way you say it does. If it were going to fail, I would expect it to fail completely. Does your HTML document contain a CDATA section or SGML comment with </body></html> inside it? Or is it really two or more HTML documents run together?

继续阅读：.net regex string-parsing

How to extract string between 2 markers using Regex in .NET?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？