开发者

Regular Expression (Javascript) - Take a scrambled word and find an unscrambled match

I have a list of all the words in the English dictionary (270,000+ words) stored in a variable called theList. I have a scrambled word word that I want to unscramble by matching against the word list. Initially, I thought that the following code would do the trick, but it doesn't work so well.

var theList; // Contains all the words in the English dictionary.

var word = "iexospensr"; // The word I want to unscramble.

var matches = word.match(new RegExp("^["+word+"]{"+word.length+"}$", "gim"));

I'd have expected "EXPRESSION" as the unscrambled result, but I instead get a lot more results (listed below).

EERINESSES,EXPRESSERS,EXPRESSION,IRONNESSES,ISOSPORIES,NONPERSONS,NONPROSSES,NOSINESSES,OPENNESSES,OPPRESSION,OPPRESSORS,ORNERINESS,PENSIEROSO,PENSIONEER,PENSIONERS,PEPPERONIS,PERSIENNES,PERSONISES,PIPINESSES,PIXINESSES,POORNESSES,PORINESSES,POSSESSION,POSSESSORS,PREEXPOSES,PREPOSSESS,PREP开发者_如何学编程PINESS,PRESENSION,PRIORESSES,PRISSINESS,PROPENSION,PROPERNESS,REINSPIRES,REPRESSERS,REPRESSION,REPRESSORS,RESERPINES,RESPONSERS,RESPONSORS,RIPENESSES,ROPINESSES,ROSINESSES,SERENENESS,SEXINESSES,SIXPENNIES,SNIPPINESS,SORENESSES,SPINNERIES

Perhaps, if I could find a way to tell the regular expression to consider each letter in the string word just once irrespective of the order of the letters. So the end result would be an array of combination of those letters, and not permutations (what I have now).

Any help would be appreciated.


EDIT: I think the way to go would be: 1. finding all the combinations of the scrambled word 2. matching them against the word list to check for validity

If you have a better solution (performance-wise), it'd help.


The best solution to this problem seems to be reordering the anagram by alphabet, and the entire word list and matching the word against each item in the list.

Here's the code:

    var textList; // the entire dictionary
    var list = textList.match(/^.*$/gim);
    var sortedList = [];
    list.forEach(function(element, index, array) {
        sortedList[index] = element.split("").sort().join("");
    });

    function unscramble(word)
    {
        word = word.toUpperCase().split("").sort().join("");
        var matches = [];
        for (var i = 0; i < list.length; i++) {
            if (word.indexOf(sortedList[i]) >= 0) {
                if (!matches[list[i].length])
                    matches[list[i].length] = [];
                matches[list[i].length].push(list[i]);
            }
        }
        return matches;
    }


I think a better approach would not use regular expressions. Instead it would test each member of the list against your scrambled word, by walking the characters of the word, and looking if that character exists in the word in the list. Each time it finds a character, it can mark that character as "already used".

Here's something to mark a character position as "used":

function checkUsed(o, which) {
if (o[which] != null) {
  o[which] = 1;
  return false;
  }
return true;
}


var usedMap = [];

if (checkUsed(usedMap, 5) == false) {
 ...
 }


Don't use regular expressions for this, there are simpler ways, if you split your dictionary into words instead of doing a mega large string:

  1. A scrambled word is defined by the frequency of letter occurence:

    //WARNING, untested code
    
    alphabet = 'qwertyuiopasdfghjklzxcvbnm';
    function empty_frequences(){
        var freqs = {};
        var i=;
        for(i=0; i<alphabet.length; i++){
            freqs[alphabet[i]] = 0;
        }
        return freqs;
    }
    
    function frequences(str){
        var freqs = empty_frequences();
        var i;
        for(i=0; i<str.length; i++){
            freqs[str[i]] += 1;
        }
    }
    
  2. Use this fact to find all matches in your dictionary

    function matcher(word){
         //returns a function that matchs against this word
         var word_freqs = frequences(word);
         function do_the_match(word2){
             var freqs2 = frequences(word2);
             var i, c;
             for(i=0; i<alphabet.length; i++){
                 c = alphabet[i]
                 if(freqs[c] > freqs2[c]){return false;}
                 //change > to != to allow only strict anagrams
             }
             return true;
         }
         return do_the_match;
     }
    
     function main(word, dict){
         var mf = matcher(word);
         var i, matcheds = [];
         for(i=0; i<dict.length; i++){
             if(mf(dict[i])){ matcheds.push(dict[i]); }
         }
         return matcheds;
     }
    


Here's an idea for you. Constructing the initial lookup data will be slow, but finding a match should be simple. However, you should only build the dictionary once and load it! Recomputing every time is a waste of time.

  1. I am assuming you're only using the Latin alphabet (i.e. what English is written in), everything is case insensitive and you use no numerals...etc. so you have only characters A-Z.

  2. For each word in your dictionary, build a "hash" based on the counts of each letters occurrence. The hash-array will have 26 positions. Each position will the count of the times a specific character for that position was encountered. (e.g. A is in the first array position/index 0; Z in 26th/index 25)
    To cheat a little you can store the results as a a pair of strings. Few, if any, words have 9 repetitions of a single letter, so a single "digit" per letter should work fine. For example: "the" becomes "00001001000000000001000000"; "hat" becomes "10000001000000000001000000"; "that" becomes "10000001000000000002000000".

  3. Load the precomputed dictionary. Use the hashed value as a key in a key-value pair, and have a collection as the value. Append each word with the same key to the end of the collection for that key.

  4. Perform the same hash algorithm on the scrambled word, and look up the key. Output the collection referred to by the key.

EDIT 1: If building a dictionary up front is not viable, then use a variation on this where you create an associative array/dictionary with the letter as the key, and the count of the times it's found as the value. Before computing this, compare lengths, if the strings are of different lengths, then don't bother comparing as you know they mismatch. After computing these arrays for the source (scrambled) and the target (a possible match) compare the keys and values in your associative array.

EDIT 2: Pretty much along the same lines as above, sort the characters within the string for both source and target strings.


Just for the fun of it:

> var words = 'exceptional extraordinary retinas retains retsina antsier nastier retrains starfish';
> words.match(/\b([aeinrst])(?!\1)([aeinrst])(?!\1|\2)([aeinrst])(?!\1|\2|\3)([aeinrst])(?!\1|\2|\3|\4)([aeinrst])(?!\1|\2|\3|\4|\5)([aeinrst])(?!\1|\2|\3|\4|\5|\6)([aeinrst])\b/ig)
[ 'retinas', 'retains', 'retsina', 'antsier', 'nastier' ]

Note that I can't quite figure out how to get the above method to work if you have two of the same letter, e.g. I can't match "boo" :)


If lookup has to be fast, and buildup at the start is no big problem then using a Trie is the most efficient solution I know of. I could explain it, but the WP article is actually very good and gives code samples.

The solution using histograms is probably the best if you're interested primarily in whether 2 given strings match.


I don't know if a regular expression is the best tool for this job. The regex you're building will end up being

"^[iexospensr]{10}$"

which matches any 10-letter word made up of any of the letters in the character class [iexospensr].

Perhaps, if I could find a way to tell the regular expression to consider each letter in the string word just once irrespective of the order of the letters.

You could do that with word.length different regular expressions, but some of your letters repeat. You'd get closer if you sort the letters in the scrambled word, then search for words that have the right number of repetitions of each letter. For example, two e's, two s's, one x, etc.


Regular expressions although powerful aren't the solution for everything.

In some cases like this it's just better to build your own solution: Start by removing all words that don't match the required length, then start comparing letters.

Depending on the length of your dictionary you can build in different optimizations.


I should have seen this Q&A long time ago. I have been doing research on this and I want to share my solution to the problem.

Solution: Step 1: Sort alphabetically the scrambled word (Note: or even the scrambled page of the book for that matter)

Step 2: Build your WORD or PAGE list with an additional column for the sorted word (Note: You can hash this column if you want)

Step 3: Do your matching process. This should find scrambled word from the lookup list.

I was doing some research on finding arbitrary scrambled no. of words in a page and creating a list containing those scrambled words given the scrambled letters.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜