开发者

Regex: Match URLs for specific domain EXCEPT when a certain querystring parameter has a certain value

In short, I need to match all URLs in a block of text that are for a certain domain and don't contain a specific querystring parameter and value (refer=twitter)

I have the following regex to match all URLs for the domain.

\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?

I just can't get the last part to work

(?![&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com(/[^\s]*)?

So the following SHOULD match

example.com
http://example.com/
https://www.example.com#link
www.example.com?somevalue=foo

But these should NOT

https://www.anotherexample.com#link
www.example.com?refer=twitter

EDIT: And if you can get it to match the

http://example.com?foo=foo.bar 

out of a sentence like

For examples go to http://example.com?foo=foo.bar.

without picking up 开发者_StackOverflowthe period, that would be great!

EDIT2: Fixed the trailing period issue with this

\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?

EDIT3: This seems to work, or at least 99% of the tests I've thrown at it

(?!\b.*[&?]refer=twitter)\b(https?://)?([a-z0-9-]+\.)*example\.com/?([^\s]*[^.])?

EDIT4: Settled on

\b(?!.*[&?]refer=twitter)(https?://)?([a-z0-9-]+\.)*nygard\.com(?!\.)[^\s]*\b+


(?!\b.*[&?]refer=twitter)

Is what you're looking for.


To be honest, at first the thought of using a regex didn't even cross my mind (which is a good sign - using a regex must, IMO, always be a secondary option, not primary). Here is how I'd do it in my language of choice

>>> from urlparse import urlparse, parse_qs
>>> p = urlparse(r'http://foo.bar.com/baz?refer=twitter&rock=paper')
>>> parse_qs(p.query)
{'rock': ['paper'], 'refer': ['twitter']}

You can do anything from here.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜