Select a part of the matching string using regular expressions
The data which I need to extract from a web page is delimited by specific comments: <!--data-->
.
I use this expression: <!--data-->.+?<!--data-->
ad it works fine.
But maybe there is a way to to get the text without the htm开发者_如何学Gol comments at the beginning and at the end of the string?
I also need this when looking for img tags in html code but the result shuld contain only the link to the picture.
Is this possible to include in a regular expression?
If you wrap the part of the regex you wish to capture in parentheses ( )
you can retrieve the captured string with $1, $2, etc.
In general though, parsing HTML with regular expressions is a very bad idea. See this answer: RegEx match open tags except XHTML self-contained tags
If you want to exclude this stuff, put brackets around the part you want and use the capturing group or use lookaround assertions.
Solution 1:
<!--data-->(.+?)<!--data-->
Your result is in group 1. How you get the content of this capturing group depends on your language. You should really add this information to your question.
Solution 2:
(?<=<!--data-->).+?(?=<!--data-->)
Matched only the stuff defined by .*?
. Will work only when your language support look behind and look ahead assertions.
Solution 3:
Use a Html parser. This is probably in your case the best solution. Because Html supports nested tags and its not possible to reliably match those with regular expressions.
If you tell us the language you use, you can maybe get a good answer using a parser available to this language.
精彩评论