Help Parsing a string in PHP
I will have a string like this:
Bob is a boy. Bob is 1000 years old! <b>Bob loves you!</b> Do you love bob?
I want to parse it into an array, using the following delimiters to identify each array element:
.
!
?
<b> and </b>
So I will have an array with the following structure:
[0]Bob is a boy.
[1]Bob is 1000 years old!
[2]Bob loves you!
[3]Do you love bob?
Any ideas?
As you can see, i'd like the text between <b>
and </b>
to be extracted, previously I'm using the following regexp to do it:
p开发者_开发技巧reg_match_all(":<b>(.*?)</b>:is", $text, $matches);
I think this should accomplish what you're going for:
$string = 'Bob is a boy. Bob is 1000 years old! <b>Bob loves you!</b> Do you love bob?';
// parser
$array = preg_split('/[\.|\!\?]|[\s]*<b>|<\/b>[\s]*/', $string, 0, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_OFFSET_CAPTURE);
foreach ($array as $key => $element) $array[$key] = trim($element[0]).substr($string,$element[1]+strlen($element[0]),1);
print_r($array);
It yields:
Array
(
[0] => Bob is a boy.
[2] => Bob is 1000 years old!
[4] => Bob loves you!
[6] => Do you love bob?
)
The first line of the parser grabs each of the strings of text between the delimiters and their offsets in the string. The second line adds the punctuation marks from the original string to the end of each element.
If nobody provides a better solution, this almost works:
(?:<b>|[.!?]*)((?:[^<]+?)(?:[.!?]+|</b>))\s+
Only it would return Bob loves you!</b>
in third match, which can be cleaned by applying strip_tags()
to results I guess...
Divide and conquer?
assume $myString is your string...
First grab your quoted stuff:
preg_match (" /(.*?)<b>(.*?)<\/b>(.*?)/", $myString);
now you have $1, $2, and $3
$firstMatches = preg_split("/[\.\!\?]/", $1);
$lastMatches = preg_split("/[\.\!\?]/", $3);
Then get your punctuation back:
function addPunctuation($matches, $myString)
{
$punctuadedResults = array();
foreach($matches as $match)
{
$position = strpos( $myString, $match);
#position is the offset of the start of your match. Find the character after your match.
$punctMark = substr($myString, $position + length($match), 1);
$punctuadedResults[] = $match . $punctMark;
}
return $punctuadedResults;
}
$allMatches = addPunctuation($firstMatches, $myString);
$allMatches[] = $2;
$allMatches = array_merge($allMatches, addPunctuation($lastMatches, $myString) );
精彩评论