Fastest PHP Routine To Match Words
What i开发者_开发百科s the fastest way in PHP to take a keyword list and match it to a search result (like an array of titles) for all words?
For instance, if my keyword phrase is "great leather shoes", then the following titles would be a match...
- Get Some Really Great Leather Shoes
- Leather Shoes Are Great
- Great Day! Those Are Some Cool Leather Shoes!
- Shoes, Made of Leather, Can Be Great
...while these would not be a match:
- Leather Shoes on Sale Today!
- You'll Love These Leather Shoes Greatly
- Great Shoes Don't Come Cheap
I imagine there's some trick with array functions or a RegEx (Regular Expression) to achieve this rapidly.
I would use an index for the words in the titles and test if every search term is in that index:
$terms = explode(' ', 'great leather shoes');
$titles = array(
'Get Some Really Great Leather Shoes',
'Leather Shoes Are Great',
'Great Day! Those Are Some Cool Leather Shoes!',
'Shoes, Made of Leather, Can Be Great'
);
foreach ($titles as $title) {
// extract words in lowercase and use them as key for the word index
$wordIndex = array_flip(preg_split('/\P{L}+/u', mb_strtolower($title), -1, PREG_SPLIT_NO_EMPTY));
// look up if every search term is in the index
foreach ($terms as $term) {
if (!isset($wordIndex[$term])) {
// if one is missing, continue with the outer foreach
continue 2;
}
}
// echo matched title
echo "match: $title";
}
you can preg_grep() your array against something like
/^(?=.*?\bgreat)(?=.*?\bleather)(?=.*?\shoes)/
or (probably faster) grep each word separately and then array_intersect the results
It might be a pretty naive solution (quite possibly there are more efficient/elegant solutions), but I'ld probably do something like the following:
$keywords = array(
'great',
'leather',
'shoes'
);
$titles = array(
'Get Some Really Great Leather Shoes',
'Leather Shoes Are Great',
'Great Day! Those Are Some Cool Leather Shoes!',
'Shoes, Made of Leather, Can Be Great',
'Leather Shoes on Sale Today!',
'You\'ll Love These Leather Shoes Greatly',
'Great Shoes Don\'t Come Cheap'
);
$matches = array();
foreach( $titles as $title )
{
$wordsInTitle = preg_split( '~\b(\W+\b)?~', $title, null, PREG_SPLIT_NO_EMPTY );
if( array_uintersect( $keywords, $wordsInTitle, 'strcasecmp' ) == $keywords )
{
// we have a match
$matches[] = $title;
}
}
var_dump( $matches );
No idea how this benchmarks though.
You could use
/(?=.*?\great\b)(?=.*?\bshoes\b)(?=.*?\bleather\b)/
Note a couple of things
a)You need word boundaries at both ends else you could end up matching words that contain the ones you are looking for eg "shoes of leather bring greatness".
b)I use lazy wildcard match (i.e .*?). This improves effeciency, as by default * is greedy (i.e. it consumes as many characters as it can match, and only gives them up in favor of a overall match). So if we don't have the trailing ?, .* will match everything in the line and then backtrack to match 'great'. Same procedure is then repeated for 'shoes' and 'leather'. By making * lazy, we avoid these unnecessary backtracks.
I don't know about the absolute fastest way, but this is probably the fastest way to do it with a regex:
'#(?:\b(?>great\b()|leather\b()|shoes\b()|\w++\b)\W*+)++\1\2\3#i'
This matches every word in the string, and if the word happens to be one of your keywords, the empty capturing group "checks it off". Once all the words in the string have been matched, the back-references (\1\2\3
) ensure that each of the three keywords has been seen at least once.
The lookahead-based approach that's usually recommended for this kind of task needs to scan potentially the whole string multiple times--once for each keyword. This regex only has to scan the string once--in fact, backtracking is disabled by the possessive quantifiers (++
, *+
) and atomic groups ((?>...)
).
That said, I would still go with the lookahead approach unless I knew it it was causing a bottleneck. In most cases, its greater readability is worth the trade-off in performance.
I can't offer you a definitive answer but I'd try benchmarking each solution that's suggested and would start with chaining some in_array's together.
if (in_array('great', $list) && in_array('leather', $list) && in_array('shoes', $list)) {
// Do something
}
精彩评论