Regex to find some .php files
I'm trying to make an exclusion regex for a crawler. I want to index all the .php
files that appear in the /archives/
directory, but not anywhere else. So the regex should match all .php
files, except those that are in an /archives/
directory (however deeply nested). So, for example, it will index
www.mysite.com/archives/123qwe/index.php
but not
www.mysite.com/123qwe/index.php
I believe this regex should work: (?<!\/archives\/.*)\.php$
However, I'm not able to use the <
character, because I need to submit the r开发者_JAVA百科egex into a web form that sanitizes <
's from the input. And using <
breaks the regex. So is there another way to form this regex, without needing the <
?
What about
(?!.*\/magazine\/)(?:^.*\.php$)
This is a negative look ahead instead your negative lookbehind. This regex should match if there is no /magazine/
in the string and it ends with .php
Thats very similar to your approach, but without the <
.
You can see it in action here on Regexr
Try this:
^www\.mysite\.com(?:/(?!archives/)[^/.]+)+\.php$
Or, more legibly:
^www\.mysite\.com
(?:
/ # After consuming the `/`...
(?!archives/) # if the next name isn't `archives`...
[^/.]+ # consume it.
)+ # Repeat as needed.
\.php$
When you're creating a regex and you're not sure how to proceed, lookbehinds should never be the first tool you reach for. In fact, I tend to regard them as a last resort. They're just not useful enough to offset the complexity they introduce.
Couldn't you just be greedy and specify that you want archive in your regular expression?
^(\/archives\/.+?)\.php$
精彩评论