开发者

extract stylesheets via regex

Yes, I know, I know, parsing HTML with regular expressions is very bad. But I am working with legacy code that is supposed to extract all link and style elements from a html page. I would change it and use the dom extension instead, but after the regex there is a huge code block which relies on the way preg_match_all returns the matched results.

The script is using this regex:

$pattern = '/<(link|style)(?=.+?(?:type="(text\/css)"|>))(?=.+?(?:media="(.*?)"|>))(?=.+?(?:href="(.*?)"|>))(?=.+?(?:rel="(.*?)"|>))[^&g开发者_运维技巧t;]+?\2[^>]+?(?:\/>|<\/style>)\s*/is';

preg_match_all($pattern, $htmlContent, $cssTags);

But it doesnt work. No elements are matched. Unfortunately I really suck at regex, so if someone could help me out it would be great.


I would break this problem into a few smaller one. It would be easier to write, easier to maintain. And a bit more lines of code of course. The problem with one huge regex is that there are some many gotchas and the input can be invalid which is hard to manage in one big pattern.

/<link([^>]+)>/
-> extract attributes:
   /([\w]+)\s*=\s*"([^"]*)"/

/<style[^>]*>(.+?)</style>/
-> extract inline styles

And finally merge the results into an array as if preg_match_all produced it.


To grab the external resources only:

preg_match_all('#(<link\s(?:[^>]*rel="stylesheet")[^>]*>)\R?#is', $content, $matches, PREG_SET_ORDER)


If I was doing this with regular expressions, e.g. because you need to be able to handle invalid HTML which is often difficult with a proper parser, I would use separate regular expressions. Use one or two regexes to get the style and link tags, and use another set of regexes to get the various attributes from each tag.

Your regex tries to do everything at once by using lookahead to scan the opening tag repeatedly to get all the elements. That's a neat trick in a situation where one regex is all you can use, but not something to be recommended when writing your own code.

I have made some improvements to your regex. I replaced the .*? and .+? with negated character classes where possible for efficiency. The reason why your regex didn't work is that it doesn't correctly try to match the closing tag or correctly handle link tags that have no closing tag. I fixed that.

The regex:

<(link|style)(?=[^<>]*?(?:type="(text/css)"|>))(?=[^<>]*?(?:media="([^<>"]*)"|>))(?=[^<>]*?(?:href="(.*?)"|>))(?=[^<>]*(?:rel="([^<>"]*)"|>))(?:.*?</\1>|[^<>]*>)

PHP:

$pattern = '%<(link|style)(?=[^<>]*?(?:type="(text/css)"|>))(?=[^<>]*?(?:media="([^<>"]*)"|>))(?=[^<>]*?(?:href="(.*?)"|>))(?=[^<>]*(?:rel="([^<>"]*)"|>))(?:.*?</\1>|[^<>]*>)%si'


Thanks at all for your answers, but I finally rewrote that bit using the DOM extension. That should make it way more robust.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜