开发者

regex to match specific html tags

I need to match html tags(the whole tag), based on the tag name.

For script tags I have this:

<script.+src=.+(\.js|\.axd).+(</script>|>)

It correctly matches both tags in the following html:

<script src="Scripts/JScript1.js" type="text/javascript" />
<script type="text/javascript" src="Scripts/JScript2.js" />

However, when I do link tags with the following:

<link.+href=.+(\.css).+(</link>|>)

It matches all of this at once(eg it returns one match containing both items):

<link href="Styleshe开发者_如何学Pythonets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
<link href="Stylesheets/StyleSheet2.css" rel="Stylesheet" type="text/css" />

What am I missing here? The regexes are essentially identical except for the text to match to?

Also, I know that regex is not a great tool for HTML parsing...I will probably end up using the HtmlAgilityPack in the end, but this is driving me nuts and I want an answer if only for my own mental health!


The .+ wildcards match anything. This:

<link.+href=.+(\.css).+(</link>|>)

Likely matches like this:

<link      => <link
.+         => href="Stylesheets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
              <link 
 href=     => href=
 .+        => "Stylesheets/StyleSheet2
 \.css     => .css
 .+        => " rel="Stylesheet" type="text/css" /
 </link>|> => >

Instead consider using [^>]+ in place of .+. Also, do you really care about the closing tag?

<link[^>]+href=[^>]+(\.css)[^>]+>


The problem is your regex is greedy. Whenever you match .+ this is greedy; you need to make it non-greedy by appending a ? to them which makes it match a limited number of characters to satisfy the pattern and not go beyond it to the next matching string.

Change the pattern to this: "<link.+?href=.+?(\.css).+?(</link>|>)"

Then you'll need to use Regex.Matches to get multiple matches and loop over them.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜