Regular Expression to match a string only when certain characters don't exist

2022-12-12 05:57 问答作者：

So, here's my question:

I have a crawler that goes and downloads web pages and strips those of URLs (for future crawling). My crawler operates from a whitelist of URLs which are specified in regular expressions, so they're along the lines of:

(http://www.example.com/subdirectory/)(.*?)

..开发者_运维问答.which would allow URLs that followed the pattern to be crawled in the future. The problem I'm having is that I'd like to exclude certain characters in URLs, so that (for example) addresses such as:

(http://www.example.com/subdirectory/)(somepage?param=1¶m=5#print)

...in the case above, as an example, I'd like to be able to exclude URLs that feature ?, #, and = (to avoid crawling those pages). I've tried quite a few different approaches, but I can't seem to get it right:

(http://www.example.com/)([^=\?#](.*?))

etc. Any help would be really appreciated!

EDIT: sorry, should've mentioned this is written in Python, and I'm normally fairly proficient at regex (although this has me stumped)

EDIT 2: VoDurden's answer (the accepted one below) almost yields the correct result, all it needs is the $ character at the end of the expression and it works perfectly - example:

(http://www.example.com/)([^=\?#]*)$

(http://www.example.com/)([^=?#]*?)

Should do it, this will allow any URL that does not contain the characters you don't want.

It might however be a little bit hard to extend this approach. A better option is to have the system work two-tiered, i.e. one set of matching regex, and one set of blocking regex. Then only URL:s which pass both of these will be allowed. I think this solution will be a bit more transparent and flexible.

This expression should be what you're looking for:

(http://www.example.com/subdirectory/)([^=?#]*)$

[^=\?#] Will match anything except for the characters you specified.

For Example:

http://www.example.com/subdirectory/ Match
http://www.example.com/subdirectory/index.php Match
http://www.example.com/subdirectory/somepage?param=1&param=5#print No Match
http://www.example.com/subdirectory/index.php?param=1 No Match

You will need to crawl the pages upto ?param=1&param=5

because normally param=1 and param=2 could give you completely different web page.

pick up one the wordpress website to confirm that.

Try like this one, It will try to match just before # char

(http://www.example.com/)([^#]*?)

I'm not sure of what you want. If you wan't to match anything that doesn't containst any ?, #, and = then the regex is

([^=?#]*)

As an alternative there's always the urlparse module which is designed for parsing urls.

from urlparse import urlparse

urls= [
    'http://www.example.com/subdirectory/',
    'http://www.example.com/subdirectory/index.php',
    'http://www.example.com/subdirectory/somepage?param=1&param=5#print',
    'http://www.example.com/subdirectory/index.php?param=1',
]

for url in urls:
    # in python 2.5+ you can use urlparse(url).query instead
    if not urlparse(url)[4]:
        print url

Provides the following:

http://www.example.com/subdirectory/
http://www.example.com/subdirectory/index.php

继续阅读：python regex

Regular Expression to match a string only when certain characters don't exist

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？