How to extract string between 2 markers using Regex in .NET?
I have a source to a web page and I need to extract the body. So anything between </head><body>
and </body></html>
.
I've tried the following with no success:
var match = Regex.Match(output, @"(?<=\</head\>\<body\>)(.*?)(?=\</body\>\</html\>)");
It finds a 开发者_运维问答string but cuts it off long before </body></html>
. I escaped characters based on the RegEx cheat sheet.
What am i missing?
I'd recommend using the HtmlAgilityPack instead - parsing HTML with regular expressions is very, very fragile.
The latest version even supports Linq so you can get your content like this:
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://stackoverflow.com");
string html = doc.DocumentNode.Descendants("body").Single().InnerHtml;
Regex is not meant for such html handling, as many here would say. Without having your sample web page / html, I can only say that try removing the non-greedy ?
quantifier in (.*?)
and try. After all, a html page will have only one head and body.
Though regexes are definitely not the best tool for this task, there are a few suggestions and points I would like to make:
- un-escape the angle brackets - with the @ before your string, they are going through to the regex and they do not need to be escaped for a .NET regex
- with your regex, you need to make sure that the head/body tag combinations do not have any white-space between them.
- with your regex, the body tag cannot have any attributes.
I would suggest something more like:
(?<=</head>\s*<body(\s[^>]*)?>)(.*?)(?=</body>\s*</html>)
this seems to work for me on the source of this page!
As the others have said, the correct way to handle this is with an HTML-specific tool. I just want to point out some problems with that cheat-sheet.
First, it's wrong about angle brackets: you do not need to escape them. In fact, it's wrong twice: it also says \<
and \>
match word boundaries, which is both incorrect for .NET, and incompatible with the advice about escaping angle brackets.
That cheat-sheet is just a random collection of regex syntax elements; most of them will work in most flavors, but many are guaranteed not to work in your particular flavor, whatever it happens to be. I recommend you disregard it and rely instead on .NET-specific documents or Regular-Expressions.info. The books Mastering Regular Expressions and Regular Expressions Cookbook are both excellent, too.
As for your regex, I don't see how it could behave the way you say it does. If it were going to fail, I would expect it to fail completely. Does your HTML document contain a CDATA section or SGML comment with </body></html>
inside it? Or is it really two or more HTML documents run together?
精彩评论