regex to match specific html tags
I need to match html tags(the whole tag), based on the tag name.
For script tags I have this:
<script.+src=.+(\.js|\.axd).+(</script>|>)
It correctly matches both tags in the following html:
<script src="Scripts/JScript1.js" type="text/javascript" />
<script type="text/javascript" src="Scripts/JScript2.js" />
However, when I do link tags with the following:
<link.+href=.+(\.css).+(</link>|>)
It matches all of this at once(eg it returns one match containing both items):
<link href="Styleshe开发者_如何学Pythonets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
<link href="Stylesheets/StyleSheet2.css" rel="Stylesheet" type="text/css" />
What am I missing here? The regexes are essentially identical except for the text to match to?
Also, I know that regex is not a great tool for HTML parsing...I will probably end up using the HtmlAgilityPack in the end, but this is driving me nuts and I want an answer if only for my own mental health!
The .+ wildcards match anything. This:
<link.+href=.+(\.css).+(</link>|>)
Likely matches like this:
<link => <link
.+ => href="Stylesheets/StyleSheet1.css" rel="Stylesheet" type="text/css" />
<link
href= => href=
.+ => "Stylesheets/StyleSheet2
\.css => .css
.+ => " rel="Stylesheet" type="text/css" /
</link>|> => >
Instead consider using [^>]+ in place of .+. Also, do you really care about the closing tag?
<link[^>]+href=[^>]+(\.css)[^>]+>
The problem is your regex is greedy. Whenever you match .+
this is greedy; you need to make it non-greedy by appending a ?
to them which makes it match a limited number of characters to satisfy the pattern and not go beyond it to the next matching string.
Change the pattern to this: "<link.+?href=.+?(\.css).+?(</link>|>)"
Then you'll need to use Regex.Matches
to get multiple matches and loop over them.
精彩评论