Extracting a 'word' matching certain criteria
I have the following string:
SEDCVBNT S800BG09 7GFHFGD6H 324235346 RHGF7U S8-00BG/09 7687678
and the following regex:
preg_match_all('/\b(?=.+[0-9])(?=.+[A-Z])[A-Z0-9-\/]{4,20}/i', $string, $matches)
What I'm trying to achieve is return all the 'words' that:
- contain at least 1 number
- contain at least 1 letter
- may contain '/'
- may contain '-'
Unfortunately, the above regex returns:
Array ( [0] => Array ( [0] => SEDCVBNT [1] => S800BG09 [2] => 7GFHFGD6H [3] => 324235346 [4] => RHGF7U [5] => S8-00BG/09 ) )
I don't want 'SEDCVBNT' or '324235346' to be returned.
I've search high and low, tried so many small alterations to the above regex, but I'm just totally stuck on开发者_如何学编程 this. I'd really appreciate any help.
Thanks in advance.
You need slightly advanced regex syntax for this one.
The regex I came up with is
(?<=\s|^)(?=[\w/-]*\d[\w/-]*)(?=[\w/-]*[A-Za-z][\w/-]*)([\w/-])+(?=\s|$)
Let's explain it:
- The syntax
[\w/-]
comes up a lot; this means "any word character (which includes letters, digits, accented letters etc) or a slash or a dash" -- effectively, all characters that you consider to be part of a valid token. - The regex uses positive lookahead to make sure that, at the place where a match is attempted, the following text does satisfy certain criteria. Positive lookahead looks like this:
(?=[\w/-]*\d[\w/-]*)
. - It also uses positive (the one at the end:
(?=\s|$)
) and negative (at the beginning:(?<=\s|^)
) lookahead to make sure that a match is only made if the whole text token begins after a whitespace character or is at the beginning of the input string (\s|^
) and is followed by with a whitespace character or terminates the input string (\s|$
). - Since the two interior lookahead patterns are almost identical to the capture group pattern
([\w/-])+
, in effect I 'm using them to only match text that matches multiple patterns: both of the lookaheads and the capture group pattern at the end. - The first lookahead ensures that the next token includes at least one digit (
\d
). - The second lookahead ensures that the next token includes at least one letter (
A-Za-z
). - The capture group matches one or more word characters and/or
/
and-
.
Therefore, for the capture group to match, the text being examined must:
- Be preceded either by whitespace or the beginning of the input string (this prevents partial-word matches starting after a disallowed character)
- Include at least one digit in the next stretch of allowed characters (first positive lookahead)
- Include at least one letter in the next stretch of allowed characters (second positive lookahead)
- Be comprised only of word characters,
/
and-
(capturing group). - Be followed either by whitespace or the end of the input string (this prevents partial-word matches ending at a disallowed character).
Which is exactly what you require. :)
See it in action!
Note: refiddle.com seems to not play well with negative lookbehind, so the regexp after the link does not include the initial (?<=\s|^)
part. This means that it will erroneously match the DEF456
in ABC123$DEF456
.
Here is the raw regex: \b(?=\S*?\d)(?=\S*?[a-z])\S+?(?=$|\s)
preg_match_all('/\b(?=\S*?\d)(?=\S*?[a-z])\S+?(?=$|\s)/i', $string, $matches)
精彩评论