开发者

Find all words containing characters in UNIX

Given a word W, I want to find all words containing the letters in W from /usr/dict/words. For example, "bat" should return "bat" and "tab" (but not "table").

Here is one solution which involves sorting the input word and matching:

word=$1
sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`

while read line
do
    sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
    if [ "$sortedWord" == "$sortedLine" ]
    then
        echo $line
    fi
done < /usr开发者_JAVA技巧/dict/words

Is there a better way? I'd prefer using basic commands (instead of perl/awk etc), but all solutions are welcome!

To clarify, I want to find all permutations of the original word. Addition or deletion of characters is not allowed.


here's an awk implementation. It finds the words with those letters in "W".

dict="/usr/share/dict/words"
word=$1
awk -vw="$word" 'BEGIN{
  m=split(w,c,"")
  for(p=1;p<=m;p++){ chars[c[p]]++ }
}
length($0)==length(w){
  f=0;g=0
  n=split($0,t,"")
  for(o=1;o<=n;o++){
    if (!( t[o] in chars) ){
       f=1; break
    }else{ st[t[o]]++ }
  }
  if (!f || $0==w){
      for(z in st){
        if ( st[z] != chars[z] ) { g=1 ;break}
      }
      if(!g){ print "found: "$0 }
  }
  delete st
}' $dict

output

$ wc -l < /usr/share/dict/words
479829

$ time ./shell.sh look
found: kolo
found: look

real    0m1.361s
user    0m1.074s
sys     0m0.015s

Update: change of algorithm, using sorting

dict="/usr/share/dict/words"
awk 'BEGIN{
  w="table"
  m=split(w,c,"")
  b=asort(c,chars)
}
length($0)==length(w){
  f=0
  n=split($0,t,"")
  e=asort(t,d)
  for(i=1;i<=e;i++) {
    if(d[i]!=chars[i]){
        f=1;break
    }
  }
  if(!f) print $0
}' $dict

output

$ time ./shell.sh #looking for table
ablet
batel
belat
blate
bleat
tabel
table

real    0m1.416s
user    0m1.343s
sys     0m0.014s

$ time ./shell.sh #looking for chairs
chairs
ischar
rachis

real    0m1.697s
user    0m1.660s
sys     0m0.014s

$ time perl perl.pl #using beamrider's Perl script
table
tabel
ablet
batel
blate
bleat
belat

real    0m2.680s
user    0m1.633s
sys     0m0.881s

$ time perl perl.pl # looking for chairs
chairs
ischar
rachis

real    0m14.044s
user    0m8.328s
sys     0m5.236s


Here's a shell solution. The best algorithm seems to be #4. It filters out all words that are of incorrect length. Then, it sums the words using a simple substitution cipher (a=1, b=2, A=27, ...). If the sums match, then it will actually do the original sort and compare. On my system, it can churn through ~235k words looking for "bat" in just under 1/2 second. I'm providing all of my solutions so you can see the different approaches.

Update: not shown, but I also tried putting the sum inside the first bin of the histogram approach I tried, but it was even slower than the histograms without. I thought it would function as a short circuit, but it didn't work.

Update2: I tried the awk solution and it runs in about 1/3 the time of my best shell solution or ~0.126s versus ~0.490s. The perl solution runs ~1.1s.

#!/bin/bash

word=$1
#dict=words
dict=/usr/share/dict/words
#dict=/usr/dict/words

alg1() {
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`

  while read line
  do
    sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
    if [ "$sortedWord" == "$sortedLine" ]
    then
      echo $line
    fi
  done < $dict
}

check_sorted_versus_not() {
    local word=$1
    local line=`echo $2 | grep -o . | sort | tr -d '\n'`
    if [ "$word" == "$line" ]
    then
        echo $2
    fi
}

# Filter out all words of incorrect length
alg2() {
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sortedLine=`echo $line | grep -o . | sort | tr -d '\n'`
    if [ "$sortedWord" == "$sortedLine" ]
    then
      echo $line
    fi
  done
}


# Create a lot of variables like this:
# _a=1, _b=2, ... _z=26, _A=27, _B=28, ... _Z=52
gen_chars() {
#  [ -n "$GEN_CHARS" ] && return
  GEN_CHARS=1
  local alpha="abcdefghijklmnopqrstuvwxyz"
  local upperalpha=`echo -n $alpha | tr 'a-z' 'A-Z'`
  local both="$alpha$upperalpha"
  for ((i=0; i < ${#both}; i++))
  do
    ACHAR=${both:i:1}
    eval "_$ACHAR=$((i+1))"
  done
}

# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word one char at a time by building an arithmetic expression
# and then evaluate that expression.
# Requires: gen_chars
sum_word() {
  SUM=0
  local s=""
  # parsing input one character at a time
  for ((i=0; i < ${#1}; i++))
  do
    ACHAR=${1:i:1}
    s="$s\$_$ACHAR+"
  done

  SUM=$(( $(eval echo -n ${s}0) ))
}

# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word one char at a time using a case statement.
sum_word2() {
  SUM=0
  local s=""
  # parsing input one character at a time
  for ((i=0; i < ${#1}; i++))
  do
    ACHAR=${1:i:1}
    case $ACHAR in
    a) SUM=$((SUM+  1));;
    b) SUM=$((SUM+  2));;
    c) SUM=$((SUM+  3));;
    d) SUM=$((SUM+  4));;
    e) SUM=$((SUM+  5));;
    f) SUM=$((SUM+  6));;
    g) SUM=$((SUM+  7));;
    h) SUM=$((SUM+  8));;
    i) SUM=$((SUM+  9));;
    j) SUM=$((SUM+ 10));;
    k) SUM=$((SUM+ 11));;
    l) SUM=$((SUM+ 12));;
    m) SUM=$((SUM+ 13));;
    n) SUM=$((SUM+ 14));;
    o) SUM=$((SUM+ 15));;
    p) SUM=$((SUM+ 16));;
    q) SUM=$((SUM+ 17));;
    r) SUM=$((SUM+ 18));;
    s) SUM=$((SUM+ 19));;
    t) SUM=$((SUM+ 20));;
    u) SUM=$((SUM+ 21));;
    v) SUM=$((SUM+ 22));;
    w) SUM=$((SUM+ 23));;
    x) SUM=$((SUM+ 24));;
    y) SUM=$((SUM+ 25));;
    z) SUM=$((SUM+ 26));;
    A) SUM=$((SUM+ 27));;
    B) SUM=$((SUM+ 28));;
    C) SUM=$((SUM+ 29));;
    D) SUM=$((SUM+ 30));;
    E) SUM=$((SUM+ 31));;
    F) SUM=$((SUM+ 32));;
    G) SUM=$((SUM+ 33));;
    H) SUM=$((SUM+ 34));;
    I) SUM=$((SUM+ 35));;
    J) SUM=$((SUM+ 36));;
    K) SUM=$((SUM+ 37));;
    L) SUM=$((SUM+ 38));;
    M) SUM=$((SUM+ 39));;
    N) SUM=$((SUM+ 40));;
    O) SUM=$((SUM+ 41));;
    P) SUM=$((SUM+ 42));;
    Q) SUM=$((SUM+ 43));;
    R) SUM=$((SUM+ 44));;
    S) SUM=$((SUM+ 45));;
    T) SUM=$((SUM+ 46));;
    U) SUM=$((SUM+ 47));;
    V) SUM=$((SUM+ 48));;
    W) SUM=$((SUM+ 49));;
    X) SUM=$((SUM+ 50));;
    Y) SUM=$((SUM+ 51));;
    Z) SUM=$((SUM+ 52));;
    *) SUM=0; return;;
    esac
  done
}

# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word by building an arithmetic expression using sed and then evaluating
# the expression.
# Requires: gen_chars
sum_word3() {
  SUM=$(( $(eval echo -n `echo -n $1 | sed -E -ne 's,.,$_&+,pg'`) 0))
  #echo "SUM($1)=$SUM"
}

# Filter out all words of incorrect length
# Sum the characters in the word: i.e. a=1, b=2, ...  and "abbc" = 1+2+2+3 = 8
alg3() {
  gen_chars
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
  sum_word $word
  word_sum=$SUM
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word $line
    line_sum=$SUM
    if [ $word_sum == $line_sum ]
    then
      check_sorted_versus_not $sortedWord $line
    fi
  done
}

# Filter out all words of incorrect length
# Sum the characters in the word: i.e. a=1, b=2, ...  and "abbc" = 1+2+2+3 = 8
# Use sum_word2
alg4() {
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
  sum_word2 $word
  word_sum=$SUM
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word2 $line
    line_sum=$SUM
    if [ $word_sum == $line_sum ]
    then
      check_sorted_versus_not $sortedWord $line
    fi
  done
}

# Filter out all words of incorrect length
# Sum the characters in the word: i.e. a=1, b=2, ...  and "abbc" = 1+2+2+3 = 8
# Use sum_word3
alg5() {
  gen_chars
  sortedWord=`echo $word | grep -o . | sort | tr -d '\n'`
  sum_word3 $word
  word_sum=$SUM
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word3 $line
    line_sum=$SUM
    if [ $word_sum == $line_sum ]
    then
      check_sorted_versus_not $sortedWord $line
    fi
  done
}


# I think it's faster to return the value in a var then to echo it in a sub process.
# Try summing the word one char at a time using a case statement.
# Place results in a histogram
sum_word4() {
  SUM=(0 0 0 0 0 0 0 0 0 0
       0 0 0 0 0 0 0 0 0 0
       0 0 0 0 0 0 
       0 0 0 0 0 0 0 0 0 0
       0 0 0 0 0 0 0 0 0 0
       0 0 0 0 0 0 
       0)
  # parsing input one character at a time
  for ((i=0; i < ${#1}; i++))
  do
    ACHAR=${1:i:1}
    case $ACHAR in
    a) SUM[1]=$((SUM[ 1] + 1));;
    b) SUM[2]=$((SUM[ 2] + 1));;
    c) SUM[3]=$((SUM[ 3] + 1));;
    d) SUM[4]=$((SUM[ 4] + 1));;
    e) SUM[5]=$((SUM[ 5] + 1));;
    f) SUM[6]=$((SUM[ 6] + 1));;
    g) SUM[7]=$((SUM[ 7] + 1));;
    h) SUM[8]=$((SUM[ 8] + 1));;
    i) SUM[9]=$((SUM[ 9] + 1));;
    j) SUM[10]=$((SUM[10] + 1));;
    k) SUM[11]=$((SUM[11] + 1));;
    l) SUM[12]=$((SUM[12] + 1));;
    m) SUM[13]=$((SUM[13] + 1));;
    n) SUM[14]=$((SUM[14] + 1));;
    o) SUM[15]=$((SUM[15] + 1));;
    p) SUM[16]=$((SUM[16] + 1));;
    q) SUM[17]=$((SUM[17] + 1));;
    r) SUM[18]=$((SUM[18] + 1));;
    s) SUM[19]=$((SUM[19] + 1));;
    t) SUM[20]=$((SUM[20] + 1));;
    u) SUM[21]=$((SUM[21] + 1));;
    v) SUM[22]=$((SUM[22] + 1));;
    w) SUM[23]=$((SUM[23] + 1));;
    x) SUM[24]=$((SUM[24] + 1));;
    y) SUM[25]=$((SUM[25] + 1));;
    z) SUM[26]=$((SUM[26] + 1));;
    A) SUM[27]=$((SUM[27] + 1));;
    B) SUM[28]=$((SUM[28] + 1));;
    C) SUM[29]=$((SUM[29] + 1));;
    D) SUM[30]=$((SUM[30] + 1));;
    E) SUM[31]=$((SUM[31] + 1));;
    F) SUM[32]=$((SUM[32] + 1));;
    G) SUM[33]=$((SUM[33] + 1));;
    H) SUM[34]=$((SUM[34] + 1));;
    I) SUM[35]=$((SUM[35] + 1));;
    J) SUM[36]=$((SUM[36] + 1));;
    K) SUM[37]=$((SUM[37] + 1));;
    L) SUM[38]=$((SUM[38] + 1));;
    M) SUM[39]=$((SUM[39] + 1));;
    N) SUM[40]=$((SUM[40] + 1));;
    O) SUM[41]=$((SUM[41] + 1));;
    P) SUM[42]=$((SUM[42] + 1));;
    Q) SUM[43]=$((SUM[43] + 1));;
    R) SUM[44]=$((SUM[44] + 1));;
    S) SUM[45]=$((SUM[45] + 1));;
    T) SUM[46]=$((SUM[46] + 1));;
    U) SUM[47]=$((SUM[47] + 1));;
    V) SUM[48]=$((SUM[48] + 1));;
    W) SUM[49]=$((SUM[49] + 1));;
    X) SUM[50]=$((SUM[50] + 1));;
    Y) SUM[51]=$((SUM[51] + 1));;
    Z) SUM[52]=$((SUM[52] + 1));;
    *) SUM[53]=-1; return;;
    esac
  done

 #echo ${SUM[*]}
}

# Check if two histograms are equal
hist_are_equal() {
  # Array sizes differ?
  [ ${#_h1[*]} != ${#SUM[*]} ] && return 1

  # parsing input one index at a time
  for ((i=0; i < ${#_h1[*]}; i++))
  do
    [ ${_h1[i]} != ${SUM[i]} ] && return 1
  done

  return 0
}

# Check if two histograms are equal
hist_are_equal2() {
  # Array sizes differ?
  local size=${#_h1[*]}
  [ $size != ${#SUM[*]} ] && return 1

  # parsing input one index at a time
  for ((i=0; i < $size; i++))
  do
    [ ${_h1[i]} != ${SUM[i]} ] && return 1
  done

  return 0
}

# Filter out all words of incorrect length
# Use sum_word4 which generates a histogram of character frequency
alg6() {
  sum_word4 $word
  _h1=${SUM[*]}
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word4 $line
    if hist_are_equal
    then
      echo $line
    fi
  done
}

# Filter out all words of incorrect length
# Use sum_word4 which generates a histogram of character frequency
alg7() {
  sum_word4 $word
  _h1=${SUM[*]}
  grep_string="^`echo -n $word | tr 'a-zA-Z' '.'`\$"

  grep "$grep_string" "$dict" | \
  while read line
  do
    sum_word4 $line
    if hist_are_equal2
    then
      echo $line
    fi
  done
}

run_test() {
  echo alg$1
  eval time alg$1
}

#run_test 1
#run_test 2
#run_test 3
run_test 4
#run_test 5
run_test 6
#run_test 7


#!/usr/bin/perl
$myword=join("", sort split (//, $ARGV[0]));
shift;
while (<>) {
    chomp;
    print "$_\n" if (join("", sort split (//)) eq $myword);
}

Use it like this: bla.pl < /usr/dict/words searchword


You want to find words containing only a given set of characters. A regex for that would be:

'^[letters_you_care_about]*$'

So, you could do:

grep "^[$W]*$" /usr/dict/words

The '^' matches the beginning of the line; '$' is for the end of the line. This means we must have an exact match, not just a partial match (e.g. "table").

'[' and ']' are used to define a group of possible characters allowed in one character space of the input file. We use this to find words in /usr/dict/word that only contain the characters in $W.

The '*' repeats the previous character (the '[...]' rule), which says to find a word of any length, where all the characters are in $W.


So we have the following:

n = length of input word
L = lines in dictionary file

If n tends to be small and L tends to be huge, might we be better off finding all permutations of the input word and looking for those, rather than doing something (like sorting) to all L lines of the dictionary file? (Actually, since finding all permutations of a word is O(n!), and we have to run through the entire dictionary file once for each word, maybe not, but I wrote the code anyway.)

This is Perl - I know you wanted command-line operations but I don't have a way to do that in shell script that's not super-hacky:

sub dedupe {
    my (@list) = @_;
    my (@new_list, %seen_entries, $entry);

    foreach $entry (@list) {
        if (!(defined($seen_entries{$entry}))) {
            push(@new_list, $entry);
            $seen_entries{$entry} = 1;
        }
    }

    return @new_list;
}

sub find_all_permutations {
    my ($word) = @_;
    my (@permutations, $subword, $letter, $rest_of_word, $i);

    if (length($word) == 1) {
        push(@permutations, $word);
    } else {   
        for ($i=0; $i<length($word); $i++) {
            $letter = substr($word, $i, 1);
            $rest_of_word = substr($word, 0, $i) . substr($word, $i + 1);
            foreach $subword (find_all_permutations($rest_of_word)) {
                push(@permutations, $letter . $subword);
            }            
        }
    }

    return @permutations;
}

$words_file = '/usr/share/dict/words';
$word = 'table';

@all_permutations = dedupe(find_all_permutations($word));
foreach $permutation (@all_permutations) {
    if (`grep -c -m 1 ^$permutation\$ $words_file` == 1) {
        print $permutation . "\n";
    }
}


This utility might interest you:

an -w "tab" -m 3

...gives bat and tab only.

The original author seems to not be around any more, but you can find information at http://packages.qa.debian.org/a/an.html (even if you don't want to use it itself, the source might be worth a look).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜