extract stylesheets via regex
Yes, I know, I know, parsing HTML with regular expressions is very bad. But I am working with legacy code that is supposed to extract all link
and style
elements from a html page. I would change it and use the dom
extension instead, but after the regex there is a huge code block which relies on the way preg_match_all
returns the matched results.
The script is using this regex:
$pattern = '/<(link|style)(?=.+?(?:type="(text\/css)"|>))(?=.+?(?:media="(.*?)"|>))(?=.+?(?:href="(.*?)"|>))(?=.+?(?:rel="(.*?)"|>))[^&g开发者_运维技巧t;]+?\2[^>]+?(?:\/>|<\/style>)\s*/is';
preg_match_all($pattern, $htmlContent, $cssTags);
But it doesnt work. No elements are matched. Unfortunately I really suck at regex, so if someone could help me out it would be great.
I would break this problem into a few smaller one. It would be easier to write, easier to maintain. And a bit more lines of code of course. The problem with one huge regex is that there are some many gotchas and the input can be invalid which is hard to manage in one big pattern.
/<link([^>]+)>/
-> extract attributes:
/([\w]+)\s*=\s*"([^"]*)"/
/<style[^>]*>(.+?)</style>/
-> extract inline styles
And finally merge the results into an array as if preg_match_all produced it.
To grab the external resources only:
preg_match_all('#(<link\s(?:[^>]*rel="stylesheet")[^>]*>)\R?#is', $content, $matches, PREG_SET_ORDER)
If I was doing this with regular expressions, e.g. because you need to be able to handle invalid HTML which is often difficult with a proper parser, I would use separate regular expressions. Use one or two regexes to get the style
and link
tags, and use another set of regexes to get the various attributes from each tag.
Your regex tries to do everything at once by using lookahead to scan the opening tag repeatedly to get all the elements. That's a neat trick in a situation where one regex is all you can use, but not something to be recommended when writing your own code.
I have made some improvements to your regex. I replaced the .*?
and .+?
with negated character classes where possible for efficiency. The reason why your regex didn't work is that it doesn't correctly try to match the closing tag or correctly handle link
tags that have no closing tag. I fixed that.
The regex:
<(link|style)(?=[^<>]*?(?:type="(text/css)"|>))(?=[^<>]*?(?:media="([^<>"]*)"|>))(?=[^<>]*?(?:href="(.*?)"|>))(?=[^<>]*(?:rel="([^<>"]*)"|>))(?:.*?</\1>|[^<>]*>)
PHP:
$pattern = '%<(link|style)(?=[^<>]*?(?:type="(text/css)"|>))(?=[^<>]*?(?:media="([^<>"]*)"|>))(?=[^<>]*?(?:href="(.*?)"|>))(?=[^<>]*(?:rel="([^<>"]*)"|>))(?:.*?</\1>|[^<>]*>)%si'
Thanks at all for your answers, but I finally rewrote that bit using the DOM extension. That should make it way more robust.
精彩评论