A preg_replace puzzle: replacing zero or more of a char at the end of the subject
Say $d is a directory path and I want to ensure that it starts and ends with exactly one slash (/). It may initially have zero, one or more leading and/or trailing slashes.
I tried:
preg_replace('%^/*|/*$', '/', $d);
which works for the leading slash but to my surprise yields two trailing slashes if $d has at least one trailing slash. If the subject is, e.g., 'foo///'
then preg_replace() first matches and replaces the three trailing slashes with one slash and then it matches zero slashes at the end and replaces that with with a slash. (You can verify this by replacing the second argument with '[$0]'
.) I find this rather counterintuitive.
While there are many other ways to solve the underlying problem (and I implemented one) this became a PCRE puzzle for me: what (scalar) pattern in a single preg_replace
does this job?
ADDITIONAL QUESTION (edit)
Can anyone explain why this pattern matches the way it does a开发者_运维问答t the end of the string but does not behave similarly at the start?
$path = '/' . trim($path, '/') . '/';
This first removes all slashes at beginning or end and then adds single ones again.
Given a regex like /*
that can legitimately match zero characters, the regex engine has to make sure that it never matches more than once in the same spot, or it would get stuck in an infinite loop. Thus, if it does consume zero characters, the engine jumps forward one position before attempting another match. As far as I know, that's the only situation in which the regex engine does anything on its own initiative.
What you're seeing is the opposite situation: the regex consumes one or more characters, then on the next go-round it tries to start matching at the spot where it left off. Never mind that this particular regex can't match anything but the one character, and it already matched as many of those as it could; it still has the option of matching nothing, so that's what it does.
So, why doesn't your regex match twice at the beginning, like it does at the end? Because of the start anchor (^
). If the subject starts with one or more slashes, it consumes them and then tries to match zero slashes, but it fails because it's not at the beginning of the string any more. And if there are no slashes at the beginning, the manual bump-along has the same affect.
At the end of the subject it's a different story. If there are no slashes there, it matches nothing, tries to bump along and fails; end of story. But if it does match one or more slashes, it consumes them and tries to match again--and succeeds because the $
anchor still matches.
So in general, if you want to prevent this kind of double match, you can either add a condition to the beginning of the match to prevent it, like the ^
anchor does for the first alternative:
preg_replace('%^/*|(?<!/)/*$%', '/', $d);
...or make sure that part of the regex has to consume at least one character:
preg_replace('%^/*|([^/])/*$%', '$1/', $d);
But in this case you have a much simpler option, as demonstrated by John Kugelman: just capture the part you want to keep and chuck the rest.
preg_replace('%^/*(.*?)/*$%', '/\1/', $d)
it can be done in a single preg_replace
preg_replace('/^\/{2,}|\/{2,}$|^([^\/])|([^\/])$/', '\2/\1', $d);
A small change to your pattern would be to separate out the two key concerns at the end of the string:
- Replace multiple slashes with one slash
- Replace no slashes with one slash
A pattern for that (and the existing part for matching at the start of the string) would look like:
#^/*|/+$|$(?<!/)#
A slightly less concise, but more precise, option would be to be very explicit about only matching zero or two-or-more slashes; the notion being, why replace one slash with one slash?
#^(?!/)|^/{2,}|/{2,}$|$(?<!/)#
Aside: nikic's suggestion to use trim
(to remove leading/trailing slashes, then add your own) is a good one.
精彩评论