开发者

Generate sensible strings using a pattern

I have a table of strings (about 100,000) in following format:

pattern , string

e.g. -

*l*ph*nt , elephant
c*mp*t*r , computer
s*v* , save
s*nn] , sunny
]*rr] , worry

To simplify, assume a * denotes a vowel, a consonant stands unchanged and ] denotes either a 'y' or a 'w' (say, for 开发者_如何学Pythoninstance, semi-vowels/round-vowels in phonology).

Given a pattern, what is the best way to generate the possible sensible strings? A sensible string is defined as a string having each of its consecutive two-letter substrings, that were not specified in the pattern, inside the data-set.

e.g. -

h*ll* --> hallo, hello, holla ...

'hallo' is sensible because 'ha', 'al', 'lo' can be seen in the data-set as with the words 'have', 'also', 'low'. The two letters 'll' is not considered because it was specified in the pattern.

What are the simple and efficient ways to do this? Are there any libraries/frameworks for achieving this?

I've no specific language in mind but prefer to use java for this program.


This is particularly well suited to Python itertools, set and re operations:

import re
import itertools

VOWELS      = 'aeiou'
SEMI_VOWELS = 'wy'
DATASET     = '/usr/share/dict/words'
SENSIBLES   = set()

def digraphs(word, digraph=r'..'):
    '''
    >>> digraphs('bar')
    set(['ar', 'ba'])
    '''
    base = re.findall(digraph, word)
    base.extend(re.findall(digraph, word[1:]))
    return set(base)

def expand(pattern, wildcard, elements):
    '''
    >>> expand('h?', '?', 'aeiou')
    ['ha', 'he', 'hi', 'ho', 'hu']
    '''
    tokens = re.split(re.escape(wildcard), pattern)
    results = set()
    for perm in itertools.permutations(elements, len(tokens)):
        results.add(''.join([l for p in zip(tokens, perm) for l in p][:-1]))
    return sorted(results)

def enum(pattern):
    not_sensible = digraphs(pattern, r'[^*\]]{2}')
    for p in expand(pattern, '*', VOWELS):
        for q in expand(p, ']', SEMI_VOWELS):
            if (digraphs(q) - not_sensible).issubset(SENSIBLES):
                print q

## Init the data-set (may be long...)
## you may want to pre-compute this
## and adapt it to your data-set.
for word in open(DATASET, 'r').readlines():
    for digraph in digraphs(word.rstrip()):
        SENSIBLES.add(digraph)

enum('*l*ph*nt')
enum('s*nn]')
enum('h*ll*')


As there aren't many possibilites for two-letter substrings, you can go through your dataset and generate a table that contains the count for every two-letter substring, so the table will look something like this:

ee    1024 times
su    567 times
...
xy    45 times
xz    0 times

The table will be small as you'll only have about 26*26 = 676 values to store.

You have to do this only once for your dataset (or update the table every time it changes if the dataset is dynamic) and can use the table for evaluating possible strings. F.e., for your example, add the values for 'ha', 'al' and 'lo' to get a "score" for the string 'hallo'. After that, choose the string(s) with the highest score(s).

Note that the scoring can be improved by checking longer substrings, f.e. three letters, but this will also result in larger tables.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜