开发者

How to find basic, uninflected word for searching?

I am having trouble trying to write a search engin开发者_如何学JAVAe that treats all inflections of a word as the same basic word.

  1. So for verbs these are all the same root word, be:
    • number/person (e.g. am; is; are)
    • tense/mood like past or future tense (e.g. was; were; will be)
    • past participles (e.g. has been; had been)
    • present participles and gerunds (e.g. is being; wasn't being funny; being early is less important than being correct)
    • subjunctives (e.g. might be; critical that something be finished; I wish it were) ⁠ ⁠ ⁠

  2. Then for nouns, both the singular form and the plural form should count as the same basic word [ᴇᴅɪᴛᴏʀ's ɴᴏᴛᴇ: this is frequently referrred to as the citation form of the word.]

For example, with “enable”, I don’t want “enables” and “enabled” printed as separate entries. All three of those should count as the same basic word, the verb enable.

I can prevent printing of duplicates using a hash like:

unless ($seenmatches{ $headmatches[$l] }++)
  1. Could someone explain this? Explained in comments below.

  2. This doesn’t stop the plural/past from continuing on. Is there a way to do this, or some wholely distinct approach, perhaps one involving a regex and/or substitution, then an unsub later?

I can't modify the word with a substitution, because then the print would not print out right. Although I'm not at the stage yet, eventually I'd like to include irregular past tenses [ᴇᴅɪᴛᴏʀ's ɴᴏᴛᴇ: and irregular nouns, too?] as well

Im not sure what else you need to answer my question, so please just let me know anything I’ve unintentionally left out, and I'll fill in any missing bits to help make it clearer.


The way a typical search engine works is as follows:

  • The input string is tokenized, chopped up at word boundaries - a character offset start/end is associated with each token
  • Each token is then stemmed - I'd use Lingua::Stem (or, better, Lingua::Stem::Snowball) which are slightly updated versions of the Porter stemmer
  • Each token, with its original character offset start/end is retained and indexed, usually along with a copy of the original text, before it was tokenized. This is basically a table which associates the term text with its originating document (usually as an identifier)

Now, when a query arrives, it too is tokenized and each token stemmed, but we don't care about positions this time. We look up each token against those we have indexed, to locate the postings (matching document identifiers). We can now retrieve the stored start/end offsets to determine where the terms were in the original text.

So, you do lose the suffixes for the index (which is what it used to locate matching documents) but you preserve the original text and offsets for those documents, so you can do query highlighting and nice display stuff should you need to.

Stemming is definitely the right tool for this job. The main trick is to ensure you treat the query and the documents in the same way. You can modify the original document, but really, you want to transform it into something like a back of book index, not into a string you use regular expressions on -- if you really are doing search engine stuff, that is. Check out the excellent KinoSearch module on CPAN if you like, or look at the Apache Lucene project it was originally derived from.


The Text::English module includes a Porter stemmer, which is the usual method of treating different forms of the same word as identical for matching purposes.


Check out verbTenseChanger.pl (http://cogcomp.cs.illinois.edu/page/tools_view/1) Here's the readme:

##codes for the various tenses are:
#0 - Base Form
#1 - Past Simple
#2 - Past Participle
#3 - 3rd Person Singular
#4 - Present Participle

##Example use:
##my $newTense = changeVerbForm("see",0,4);
##changes tense from base form to the present participle

I used this (which I guess includes a stemmer) by creating the different forms:

my @changeverbforms = map changeVerbForm( $search_key, 0, $_ ), 1..4;
my @verbforms;
push (@verbforms, $changeverbforms[0]) unless ($changeverbforms[0] eq "");
push (@verbforms, $changeverbforms[1]) unless ($changeverbforms[1] eq "");
push (@verbforms, $changeverbforms[2]) unless ($changeverbforms[2] eq "");
push (@verbforms, $changeverbforms[3]) unless ($changeverbforms[3] eq "");

and then looping through @verbforms (around entire search engine perl code) and everywhere I had $search_key, I also put or $verbform. There were a few extra things to fix, but that's the general implementation (albeit to my specific circumstances)

For some debugging of the faulty online code, see: https://stackoverflow.com/questions/6459085/need-help-understanding-this-verb-tense-changing-code-please


I tried Lingua::Stem, Lingua::Stem::Snowball, and WordNet::stem, and they all fail to stem most common words. To get those simple words, you can run this simple stemmer afterwards, which uses WordNet's .exc (exception?) files:

1. Download and install WordNet.
2. export WNHOME='/usr/lib/wnres' (if that is the directory containing the dict directory; that's where Cygwin puts it. You'll need that to install Wordnet::QueryData.)
3. cat $WNHOME/dict/*.exc > wordnet.exc  (combine all the .exc files)
4. Make this perl file:

$ cat > stem.pl
use strict;
use warnings;

# Read in WordNet exception files
my $ExcFile = "wordnet.exc";
my %Stems;
open(my $FILE, "<$ExcFile") or die "Could not read $ExcFile: $!";
while (my $line = <$FILE>) {
        chomp($line);
        my ($word, $stem) = split(/\s+/, $line);
        $Stems{$word} = $stem;
}
close($FILE);

while (defined(my $in = <>)) {
        chomp($in); $in =~ s/\r$//;
        $in =~ s/^\s+//;
        $in =~ s/\s+$//;
        next if $in eq '';
        my @words = split(/\s+/, $in);
        foreach my $w (@words) {
                $w = $Stems{$w} if $Stems{$w};
        }
        print "@words\n";
}
<ctrl-D>

Then you can stem foo.txt with

perl stem.pl < foo.txt

You may want to run the other stemmers before rather than after this step, because if they're smart and use word context to stem (though I doubt they do), they'll need the full unstemmed line to work with, whereas stem.pl works word-by-word.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜