How could I find all whitespaces excluding the ones between quotes?
I need to split string by spaces, but phrase in quotes should be preserved unsplitted. Example:
word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5
this should result in array after preg_split:
array(
[0] => 'word1',
[1] => 'word2',
[2] => 'this is a phrase',
[3] => 'word3',
[4] => 'word4',
[5] => 'this is a secon开发者_JS百科d phrase',
[6] => 'word5'
)
How should I compose my regexp to do that?
PS. There is related question, but I don't think it works in my case. Accepted answer provides regexp to find words instead of whitespaces.
With the help of user MizardX from #regex irc channel (irc.freenode.net) solution was found. It even supports single quotes.
$str= 'word1 word2 \'this is a phrase\' word3 word4 "this is a second phrase" word5 word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';
$regexp = '/\G(?:"[^"]*"|\'[^\']*\'|[^"\'\s]+)*\K\s+/';
$arr = preg_split($regexp, $str);
print_r($arr);
Result is:
Array (
[0] => word1
[1] => word2
[2] => 'this is a phrase'
[3] => word3
[4] => word4
[5] => "this is a second phrase"
[6] => word5
[7] => word1
[8] => word2
[9] => "this is a phrase"
[10] => word3
[11] => word4
[12] => "this is a second phrase"
[13] => word5
)
PS. Only disadvantage is that this regexp works only for PCRE 7.
It turned out that I do not have PCRE 7 support on production server, only PCRE 6 is installed there. Even though it is not as flexible as previous one for PCRE 7, regexp that will work is (got rid of \G and \K):
/(?:"[^"]*"|\'[^\']*\'|[^"\'\s]+)+/
For the given input result is the same as above.
assuming your quotes are well defined, ie, in pairs, you can explode and go through for loop every 2 fields. eg
$str = "word1 word2 \"this is a phrase\" word3 word4 \"this is a second phrase\" word5 word6 \"lastword\"";
print $str ."\n";
$s = explode('"',$str);
for($i=1;$i<count($s);$i+=2){
if ( strpos($s[$i] ," ")!==FALSE) {
print "Spaces found: $s[$i]\n";
}
}
output
$ php test.php
Spaces found: this is a phrase
Spaces found: this is a second phrase
No complicated regexp required.
using the regex from the other question you linked this is rather easy?
<?php
$string = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';
preg_match_all( '/(\w+|"[\w\s]*")+/' , $string , $matches );
print_r( $matches[1] );
?>
output:
Array
(
[0] => word1
[1] => word2
[2] => "this is a phrase"
[3] => word3
[4] => word4
[5] => "this is a second phrase"
[6] => word5
)
Anybody want to benchmark tokenizing vs. regex? My guess is the explode() function is a little too hefty for any speed benefit. Nonetheless, here's another method:
(edited because I forgot the else case for storing the quoted string)
$str = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';
// initialize storage array
$arr = array();
// initialize count
$count = 0;
// split on quote
$tok = strtok($str, '"');
while ($tok !== false) {
// even operations not in quotes
$arr = ($count % 2 == 0) ?
array_merge($arr, explode(' ', trim($tok))) :
array_merge($arr, array(trim($tok)));
$tok = strtok('"');
++$count;
}
// output results
var_dump($arr);
$test = 'word1 word2 "this is a phrase" word3 word4 "this is a second phrase" word5';
preg_match_all( '/([^"\s]+)|("([^"]+)")/', $test, $matches);
精彩评论