开发者

Fastest way to find relevant results in array from an input array

As a mostly front-end developer, this is in the realm of computer science that I don't often delve into, but here's my scenario:

I've got an input of a string, split on spaces, say "pinto beans"

I've got a array of results to search, that contains results like: ["beans, mung","beans, pinto","beans, yellow","beans, fava"]

what might be the quickest way (preferably in javascript or php) to find the most "relevant" results, aka most matches, for instance, in the above case, I would like to sort the return array so that "beans, pinto" is put at the top, and the rest come below, and any other results would go below those.

My first attempt at this would be to do something like matching each result item against each input item, and incrementing matches on each one, then sorting by most matches to least.

This approach would require me to iterate through the entire result array a ton of times though, and I feel that my lack of CS knowledge is leaving me without the best solution here.

/* EDIT: Here's how I ended up dealing with the problem: */

Based on crazedfred's suggestion and the blog post he mentioned (which was VERY helpful), I wrote some php that basically uses a combination of the trie method and the boyer-moore method, except searching from the beginning of the string (as I don't want to match "bean" in "superbean").

I chose php for the ranking based on the fact that I'm using js libraries, and getting real benchmarks while using convenience functions and library overhead wouldn't produce the testable results I'm after, and I can't guarantee that it won't explode in one browser or another.

Here's the test data:

Search String: lima beans

Result array (from db): ["Beans, kidney","Beans, lima","Beans, navy","Beans, pinto","Beans, shellie","Beans, snap","Beans, mung","Beans, fava","Beans, adzuki","Beans, baked","Beans, black","Beans, black turtle soup","Beans, cranberry (roman)","Beans, french","Beans, great northern","Beans, pink","Beans, small white","Beans, yellow","Beans, white","Beans, chili","Beans, liquid from stewed kidney beans","Stew, pinto bean and hominy"]

First, I drop both the search string and the result array into php variables, after explode()ing the string on spaces.

then, I precompile my patterns to compare the results to:

$max = max(array_map('strlen',$input));
$reg = array();
for($m = 0; $m < $max; $m++) {
    $reg[$m] = "";
    for($ia = 0; $ia < count($input); $ia++) {
        $reg[$m]. = $input[$ia][$m];
    }
}

this gives me something like : ["lb","ie","ma","an","s"]

then, I basically take each result string (split on spaces), and match a case insensitive character class with the corresponding character number to it. If at any point during that comparison process I don't get any matches, I skip the word. This means if only 1 result starts with "b" or "l", I'll only run one comparison per WORD, which is really fast. Basically I'm taking the part of trie that compiles the searches together, and the constant speedup of the Boyer-Moore stuff.

Here's the php - I tried whiles, but got SIGNIFICANTLY better results wit开发者_如何学运维h foreaches:

$sort = array();
foreach($results as $result) {
    $matches = 0;
    $resultStrs = explode(' ', $result);
    foreach($resultStrs as $r) {
        $strlen = strlen($r);
        for($p = 0; $p < $strlen; $p++) {
            if($reg[$p])
                preg_match('/^['.$reg[$p].']/i',$r[$p],$match);
            if($match==true) {
                $matches++;
            } else {
                break 2;
            }
        }
    }
    $sort[$result] = $matches;
}

That outputs an array with the results on the keys, and how many character matches we got in total on the values.

The reason I put it that way is to avoid key collisions that would ruin my data, and more importantly, so I can do a quick asort and get my results in order.

That order is in reverse, and on the keys, so after the above code block, I run:

asort($sort);
$sort = array_reverse(array_keys($sort));

That gives me a properly indexed array of results, sorted most to least relevant. I can now just drop that in my autocomplete box.

Because speed is the whole point of this experiment, here's my results - obviously, they depend partially on my computer.

2 input words, 40 results: ~5ms 2 input words, (one single character, one whole) 126 results: ~9ms

Obviously there's too many variables at stake for these results to mean much to YOU, but as an example, I think it's pretty impressive.

If anyone sees something wrong with the above example, or can think of a better way than that, I'd love to hear about it. The only thing I can think of maybe causing problems right now, is if I were to search for the term lean bimas, I would get the same result score as lima beans, because the pattern isn't conditional based on the previous matches. Because the results I'm looking for and the input strings I'm expecting shouldn't make this happen very often, I've decided to leave it how it is for now, to avoid adding any overhead to this quick little script. However, if I end up feeling like my results are being skewed by it, I'll come back here and post about how I sorted that part.


I try to suggest a solution for JavaScript but it can works in PHP too. In this way you don't need to use nested loops, you can use just a sorting function and some regular expression.

Something like this:

var query = 'pinto beans';
var results = [ 'beans, mung','beans, pinto','beans, yellow','beans, fava' ];

// Evaluate a regular expression for your 
// query like /pinto|beans/g joining each
// query item with the alternation operator
var pattern = eval( '/' + query.split( ' ' ).join( '|' ) + '/g' );

// Define a function for custom sorting
function regsort( a, b ) {
    var ra = a.match( pattern );
    var rb = b.match( pattern );
    // return the indexing based on 
    // any single query item matched
    return ( rb ? rb.length : 0 ) - ( ra ? ra.length : 0 );
}

// Just launch the sorting
var sorted = results.sort( regsort );

// Should output the right sort:
// ["beans, pinto", "beans, mung", "beans, yellow", "beans, fava"]
console.log( sorted );

I'm not sure if this is the fastest way to handle your request but for sure it could be a nice solution to avoid a nested loop + string comparsion.

Hope this helps! Ciao


Since you specifically noted that it could be in several languages, I'll leave my answer in pseudocode so you can adapt to the language of your choice.

Since you are matching an array-to-array, performance is going to vary a lot based on your implementation, so trying several ways and considering exactly when/how/how-often this will be used in a good idea.

The simple way is to leave the data as-is and run an O(n^2) search:

for (every element in array X)
    for (every element in array Y)
        if (current X == current Y)
            add current X to results

return results

If you sorted the arrays first (a sorting algorithm such a squicksort is implemented for you in many languages, check your documentation!), then the actual matching is quicker. Use whatever string-comparison your language has:

Sort array X
Sort array Y

Let A = first element of X
Let B = first element of Y

while (A and B are in array)
    if (A > B)
        Next B
    else if (A < B)
        Next A
    else  //match!
        Add A to results
        Next A
        Next B

//handle case where one array is larger (trivial loop)

return results

Now the important part to the above solution is if the sorting of the arrays saved time versus just an ordinary O(n^2) sort. Usually, moving elements around in arrays is fast whereas string comparisons are not, so it may be worth it. Again, try both.

Finally, there's this crazy algorithm the Mailinator guy dreamed up for doing huge string comparisons in constant time using some awesome data structures. Never tried it myself, but it must work since he runs his whole site on very low-end hardware. He blogged about it here if you're interested. (Note: the blog post is about filtering spam, so some words in the post may be slightly NSFW.)


VERY PRIMITIVE

Just wanted to get that tid-bit out of the way up-front. But here is a simple implementation of sorting based on terms used. I will be up-front and mention that word variations (e.g. searching for "seasoned" and the result comes back as "seasonal") will have no effect on sorting (in fact, these words will, for all intents and purposes, but thought of as being just as different as "apple" and "orange" due to their suffix).

With that out of the way, here's is one method to do what you're looking for. I'm using this more as just an introduction as to how you can implement, it's up to you how you want to implement the "scoreWord" function. I also use jQuery's .each() method just because I'm lazy, but this can be replaced very easily with a for statement.

Anyways, here's one version, I hope it helps.

var search = "seasoned pinto beans";
var results = ["beans, mung","beans, pinto","beans, yellow","beans, pinto, seasonal","beans, fava"];

// scoreWord
// returns a numeric value representing how much of a match this is to the search term.
// the higher the number, the more definite a match.
function scoreWord(result){
    var terms = search.toLowerCase().split(/\W/),
        words = result.toLowerCase().split(/\W/),
        score = 0;
    // go through each word found in the result
    $.each(words,function(w,word){
        // and then through each word found in the search term
        $.each(terms,function(t,term){
            // exact word matches score higher (a whole point)
            if (term==word)
                score++;
            // a word found in a word should be considered a partial
            // match and carry less weight (1/2 point in this case)
            else if (term.indexOf(word)!=-1||word.indexOf(term)!=-1)
                score+=0.5;
        });
    });
    // return the score
    return score;
}
// go through and sort the array.
results.sort(function(a,b){
    // grab each result's "score", them compare them to see who rates higher in
    // the array
    var aScore = scoreWord(a), bScore = scoreWord(b);
    return (aScore>bScore?-1:(aScore<bScore?1:0));
});
// they are officially "sorted" by relevance at this point.


You can precompute a map which maps a word to many results.

var results = ["beans, mung","beans, pinto","beans, yellow","beans, fava"];
var index = {};
for (var i = 0; i < results.length; i ++) {
    results[i].replace(/\w+/g, function(a) {
        if (!index[a]) {
            index[a] = [i];
        } else {
            index[a].push (i);
        }
    });
}

When searching, you can split the query into words.

function doSearch(searchString) {
    var words = [];
    var searchResults = [];
    var currentIndex;
    searchString.replace(/\w+/g, function(a) {
        words.push (a);
    });

Build the search result as a copy of the results array, but I put it in object so that it can hold both the text and the score.

    for (var i = 0; i < results.length; i ++) {
        searchResults.push ({
            text: results[i],
            score: 0
        });
    }

Then, for each search word, increase the score in the search results.

    for (var i = 0; i < words.length; i ++) {
        if ((currentIndex = index[words[i]])) {
            for (var j = 0; j < currentIndex.length; j ++) {
                searchResults[currentIndex[j]].score ++;
            }
        }
    }

Finally, sort it by score.

    searchResults.sort (function(a, b) {
        return b.score - a.score;
    });
    return searchResults;
}

Doing `doSearch("pinto beans"), it returns an array of search results and also the score.

[{text:"beans, pinto", score:2}, {text:"beans, mung", score:1}, {text:"beans, yellow", score:1}, {text:"beans, fava", score:1}]
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜