extract stylesheets via regex

2023-01-05 02:58 问答作者：

Yes, I know, I know, parsing HTML with regular expressions is very bad. But I am working with legacy code that is supposed to extract all link and style elements from a html page. I would change it and use the dom extension instead, but after the regex there is a huge code block which relies on the way preg_match_all returns the matched results.

The script is using this regex:

$pattern = '/<(link|style)(?=.+?(?:type="(text\/css)"|>))(?=.+?(?:media="(.*?)"|>))(?=.+?(?:href="(.*?)"|>))(?=.+?(?:rel="(.*?)"|>))[^&g开发者_运维技巧t;]+?\2[^>]+?(?:\/>|<\/style>)\s*/is';

preg_match_all($pattern, $htmlContent, $cssTags);

But it doesnt work. No elements are matched. Unfortunately I really suck at regex, so if someone could help me out it would be great.

I would break this problem into a few smaller one. It would be easier to write, easier to maintain. And a bit more lines of code of course. The problem with one huge regex is that there are some many gotchas and the input can be invalid which is hard to manage in one big pattern.

/<link([^>]+)>/
-> extract attributes:
   /([\w]+)\s*=\s*"([^"]*)"/

/<style[^>]*>(.+?)</style>/
-> extract inline styles

And finally merge the results into an array as if preg_match_all produced it.

To grab the external resources only:

preg_match_all('#(<link\s(?:[^>]*rel="stylesheet")[^>]*>)\R?#is', $content, $matches, PREG_SET_ORDER)

If I was doing this with regular expressions, e.g. because you need to be able to handle invalid HTML which is often difficult with a proper parser, I would use separate regular expressions. Use one or two regexes to get the style and link tags, and use another set of regexes to get the various attributes from each tag.

Your regex tries to do everything at once by using lookahead to scan the opening tag repeatedly to get all the elements. That's a neat trick in a situation where one regex is all you can use, but not something to be recommended when writing your own code.

I have made some improvements to your regex. I replaced the .*? and .+? with negated character classes where possible for efficiency. The reason why your regex didn't work is that it doesn't correctly try to match the closing tag or correctly handle link tags that have no closing tag. I fixed that.

The regex:

<(link|style)(?=[^<>]*?(?:type="(text/css)"|>))(?=[^<>]*?(?:media="([^<>"]*)"|>))(?=[^<>]*?(?:href="(.*?)"|>))(?=[^<>]*(?:rel="([^<>"]*)"|>))(?:.*?</\1>|[^<>]*>)

PHP:

$pattern = '%<(link|style)(?=[^<>]*?(?:type="(text/css)"|>))(?=[^<>]*?(?:media="([^<>"]*)"|>))(?=[^<>]*?(?:href="(.*?)"|>))(?=[^<>]*(?:rel="([^<>"]*)"|>))(?:.*?</\1>|[^<>]*>)%si'

Thanks at all for your answers, but I finally rewrote that bit using the DOM extension. That should make it way more robust.

继续阅读：php regex

extract stylesheets via regex

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？