开发者

How to find the best possible answer to a really large seeming problem?

First off开发者_开发问答, this is NOT a homework problem. I haven't had to do homework since 1988!

  1. I have a list of words of length N
  2. I have a max of 13 characters to choose from.
  3. There can be multiples of the same letter

Given the list of words, which 13 characters would spell the most possible words. I can throw out words that make the problem harder to solve, for example:

speedometer has 4 e's in it, something MOST words don't have, 
so I could toss that word due to a poor fit characteristic, or it might just 
go away based on the algorithm

I've looked @ letter distributions, I've built a graph of the words (letter by letter). There is something I'm missing, or this problem is a lot harder than I thought. I'd rather not totally brute force it if that is possible, but I'm down to about that point right now.

Genetic algorithms come to mind, but I've never tried them before....

Seems like I need a way to score each letter based upon its association with other letters in the words it is in....


It sounds like a hard combinatorial problem. You are given a dictionary D of words, and you can select N letters (possible with repeats) to cover / generate as many of the words in D as possible. I'm 99.9% certain it can be shown to be an NP-complete optimization problem in general (assuming possibly alphabet i.e. set of letters that contains more than 26 items) by reduction of SETCOVER to it, but I'm leaving the actual reduction as an exercise to the reader :)

Assuming it's hard, you have the usual routes:

  • branch and bound
  • stochastic search
  • approximation algorithms


Best I can come up with is branch and bound. Make an "intermediate state" data structure that consists of

  • Letters you've already used (with multiplicity)
  • Number of characters you still get to use
  • Letters still available
  • Words still in your list
  • Number of words still in your list (count of the previous set)
  • Number of words that are not possible in this state
  • Number of words that are already covered by your choice of letters

You'd start with

  • Empty set
  • 13
  • {A, B, ..., Z}
  • Your whole list
  • N
  • 0
  • 0

Put that data structure into a queue.

At each step

Pop an item from the queue
Split into possible next states (branch)
Bound & delete extraneous possibilities

From a state, I'd generate possible next states as follows:

For each letter L in the set of letters left
    Generate a new state where:
        you've added L to the list of chosen letters
        the least letter is L
        so you remove anything less than L from the allowed letters

So, for example, if your left-over set is {W, X, Y, Z}, I'd generate one state with W added to my choice, {W, X, Y, Z} still possible, one with X as my choice, {X, Y, Z} still possible (but not W), one with Y as my choice and {Y, Z} still possible, and one with Z as my choice and {Z} still possible.

Do all the various accounting to figure out the new states.

Each state has at minimum "Number of words that are already covered by your choice of letters" words, and at maximum that number plus "Number of words still in your list." Of all the states, find the highest minimum, and delete any states with maximum higher than that.

No special handling for speedometer required.

I can't imagine this would be fast, but it'd work.

There are probably some optimizations (e.g., store each word in your list as an array of A-Z of number of occurrances, and combine words with the same structure: 2 occurrances of AB.....T => BAT and TAB). How you sort and keep track of minimum and maximum can also probably help things somewhat. Probably not enough to make an asymptotic difference, but maybe for a problem this big enough to make it run in a reasonable time instead of an extreme time.


Total brute forcing should work, although the implementation would become quite confusing.

Instead of throwing words like speedometer out, can't you generate the association graphs considering only if the character appears in the word or not (irrespective of the no. of times it appears as it should not have any bearing on the final best-choice of 13 characters). And this would also make it fractionally simpler than total brute force.

Comments welcome. :)


Removing the bounds on each parameter including alphabet size, there's an easy objective-preserving reduction from the maximum coverage problem, which is NP-hard and hard to approximate with a ratio better than (e - 1) / e ≈ 0.632 . It's fixed-parameter tractable in the alphabet size by brute force.

I agree with Nick Johnson's suggestion of brute force; at worst, there are only (13 + 26 - 1) choose (26 - 1) multisets, which is only about 5 billion. If you limit the multiplicity of each letter to what could ever be useful, this number gets a lot smaller. Even if it's too slow, you should be able to recycle the data structures.


I did not understand this completely "I have a max of 13 characters to choose from.". If you have a list of 1000 words, then did you mean you have to reduce that to just 13 chars?!

Some thoughts based on my (mis)understanding:

If you are only handling English lang words, then you can skip vowels because consonants are just as descriptive. Our brains can sort of fill in the vowels - a.k.a SMS/Twitter language :)

Perhaps for 1-3 letter words, stripping off vowels would loose too much info. But still:

spdmtr hs 4 's n t, smthng MST wrds dn't hv, s cld tss tht wrd d t pr ft chrctrstc, r t mght jst g wy bsd n th lgrthm

Stemming will cut words even shorter. Stemming first, then strip vowels. Then do a histogram....

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜