开发者

Regex to find some .php files

I'm trying to make an exclusion regex for a crawler. I want to index all the .php files that appear in the /archives/ directory, but not anywhere else. So the regex should match all .php files, except those that are in an /archives/ directory (however deeply nested). So, for example, it will index

www.mysite.com/archives/123qwe/index.php 

but not

www.mysite.com/123qwe/index.php

I believe this regex should work: (?<!\/archives\/.*)\.php$

However, I'm not able to use the < character, because I need to submit the r开发者_JAVA百科egex into a web form that sanitizes <'s from the input. And using &lt; breaks the regex. So is there another way to form this regex, without needing the <?


What about

(?!.*\/magazine\/)(?:^.*\.php$)

This is a negative look ahead instead your negative lookbehind. This regex should match if there is no /magazine/ in the string and it ends with .php

Thats very similar to your approach, but without the <.

You can see it in action here on Regexr


Try this:

^www\.mysite\.com(?:/(?!archives/)[^/.]+)+\.php$

Or, more legibly:

^www\.mysite\.com
(?:
  /               # After consuming the `/`...
  (?!archives/)   # if the next name isn't `archives`...
  [^/.]+          # consume it. 
)+                # Repeat as needed.
\.php$

When you're creating a regex and you're not sure how to proceed, lookbehinds should never be the first tool you reach for. In fact, I tend to regard them as a last resort. They're just not useful enough to offset the complexity they introduce.


Couldn't you just be greedy and specify that you want archive in your regular expression?

^(\/archives\/.+?)\.php$
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜