Regex to parse links containing specific words

2022-12-18 22:51 问答作者：

Taking th开发者_JAVA百科is thread a step further, can someone tell me what the difference is between these two regular expressions? They both seem to accomplish the same thing: pulling a link out of html.

Expression 1:

'/(https?://)?(www.)?([a-zA-Z0-9_%]*)\b.[a-z]{2,4}(.[a-z]{2})?((/[a-zA-Z0-9_%])+)?(.[a-z])?/'

Expression 2:

'/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/si'

Which one would be better to use? And how could I modify one of those expressions to match only links that contain certain words, and to ignore any matches that do not contain those words?

Thanks.

The difference is that expression 1 looks for valid and full URIs, following the specification. So you get all full urls that are somewhere inside of the code. This is not really related to getting all links, because it doesn't match relative urls that are very often used, and it gets every url, not only the ones that are link targets.

The second looks for a tags and gets the content of the href attribute. So this one will get you every link. Except for one error* in that expression, it is quite safe to use it and it will work good enough to get you every link – it checks for enough differences that can appear, such as whitespace or other attributes.

*However there is one error in that expression, as it does not look for the closing quote of the href attribute, you should add that or you might match weird things:

/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?<\/a>/si

edit in response to the comment:

To look for word inside of the link url, use:

/<a.*?href\s*=\s*["\']([^"\'>]*word[^"\'>]*)["\'][^>]*>.*?<\/a>/si

To look for word inside of the link text, use:

/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?word.*?<\/a>/si

In the majority of cases I'd strongly recommend using an HTML parser (such as this one) to get these links. Using regular expressions to parse HTML is going to be problematic since HTML isn't regular and you'll have no end of edge cases to consider.

See here for more info.

/<a.*?href\s*=\s*["']([^"']+)[^>]*>.*?<\/a>/si

You have to be very careful with .*, even in the non-greedy form. . easily matches more than you bargained for, especially in dotall mode. For example:

<a name="foo">anchor</a>
<a href="...">...</a>

Matches from the start of the first <a to the end of the second.

Not to mention cases like:

<a href="a"></a >
<a href="b"></a>

or:

<a href="a'b>c">

or:

<a data-href="a" title="b>c" href="realhref">

or:

<!-- <a href="notreallyalink"> -->

and many many more fun edge cases. You can try to refine your regex to catch more possibilities, but you'll never get them all, because HTML cannot be parsed with regex (tell your friends)!

HTML+regex is a fool's game. Do yourself a favour. Use an HTML parser.

At a brief glance the first one is rubbish but seems to be trying to match a link as text, the second one is matching a html element.

继续阅读：parsing php regex

Regex to parse links containing specific words

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？