开发者

Word boundary detection from text

I am having this problem with word boundary identification. I removed all the markup of the wikipedia document, now I want to get a list of entities.(meaningful terms). I am planning to take bi-grams, tri-grams of the document and check if it exists in dictionary(wordnet). Is there a better way to achieve this.

Below is the sample text. I want to identify entities(shown as surrounded by double quotes)

Vulcans are a humanoid species in the fictional "Star Trek" universe who evolve开发者_如何学编程d on the planet Vulcan and are noted for their attempt to live by reason and logic with no interference from emotion They were the first extraterrestrial species officially to make first contact with Humans and later became one of the founding members of the "United Federation of Planets"


I think what you're talking about is really still a subject of burgeoning research rather than a simple matter of applying well-established algorithms.

I can't give you a simple "do this" answer, but here are some pointers off the top of my head:

  • I think using WordNet could work (not sure where bigrams/trigrams come into it though), but you should view WordNet lookups as part of a hybrid system, not the be-all and end-all to spotting named entities
  • then, start by applying some simple, common-sense criteria (sequences of capitalised words; try and accommodate frequent lower-case function words like 'of' into these; sequences consisting of "known title" plus capitalisd word(s));
  • look for sequences of words that statistically you wouldn't expect to appear next to one another by chance as candidates for entities;
  • can you build in dynamic web lookup? (your system spots the capitalised sequence "IBM" and sees if it finds e.g. a wikipedia entry with the text pattern "IBM is ... [organisation|company|...]".
  • see if anything here and in the "information extraction" literature in general gives you some ideas: http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html

The truth is that when you look at what literature there is out there, it doesn't seem like people are using terribly sophisticated, well-established algorithms. So I think there's a lot of room for looking at your data, exploration and seeing what you can come up with... Good luck!


If I understand correctly, you're looking to extract substrings delimited by double-quotation marks ("). You could use capture-groups in regular expressions:

    String text = "Vulcans are a humanoid species in the fictional \"Star Trek\"" +
        " universe who evolved on the planet Vulcan and are noted for their " +
        "attempt to live by reason and logic with no interference from emotion" +
        " They were the first extraterrestrial species officially to make first" +
        " contact with Humans and later became one of the founding members of the" +
        " \"United Federation of Planets\"";
    String[] entities = new String[10];                 // An array to hold matched substrings
    Pattern pattern = Pattern.compile("[\"](.*?)[\"]"); // The regex pattern to use
    Matcher matcher = pattern.matcher(text);            // The matcher - our text - to run the regex on
    int startFrom   = text.indexOf('"');                // The index position of the first " character
    int endAt       = text.lastIndexOf('"');            // The index position of the last " character
    int count       = 0;                                // An index for the array of matches
    while (startFrom <= endAt) {                        // startFrom will be changed to the index position of the end of the last match
        matcher.find(startFrom);                        // Run the regex find() method, starting at the first " character
        entities[count++] = matcher.group(1);           // Add the match to the array, without its " marks
        startFrom = matcher.end();                      // Update the startFrom index position to the end of the matched region
    }

OR write a "parser" with String functions:

    int startFrom = text.indexOf('"');                              // The index-position of the first " character
    int nextQuote = text.indexOf('"', startFrom+1);                 // The index-position of the next " character
    int count = 0;                                                  // An index for the array of matches
    while (startFrom > -1) {                                        // Keep looping as long as there is another " character (if there isn't, or if it's index is negative, the value of startFrom will be less-than-or-equal-to -1)
        entities[count++] = text.substring(startFrom+1, nextQuote); // Retrieve the substring and add it to the array
        startFrom = text.indexOf('"', nextQuote+1);                 // Find the next " character after nextQuote
        nextQuote = text.indexOf('"', startFrom+1);                 // Find the next " character after that
    }

In both, the sample-text is hard-coded for the sake of the example and the same variable is presumed to be present (the String variable named text).

If you want to test the contents of the entities array:

    int i = 0;
    while (i < count) {
        System.out.println(entities[i]);
        i++;
    }

I have to warn you, there may be issues with border/boundary cases (i.e. when a " character is at the beginning or end of a string. These examples will not work as expected if the parity of " characters is uneven (i.e. if there is an odd number of " characters in the text). You could use a simple parity-check before-hand:

    static int countQuoteChars(String text) {
        int nextQuote = text.indexOf('"');              // Find the first " character
        int count = 0;                                  // A counter for " characters found
        while (nextQuote != -1) {                       // While there is another " character ahead
            count++;                                    // Increase the count by 1
            nextQuote = text.indexOf('"', nextQuote+1); // Find the next " character
        }
        return count;                                   // Return the result
    }

    static boolean quoteCharacterParity(int numQuotes) {
        if (numQuotes % 2 == 0) { // If the number of " characters modulo 2 is 0
            return true;          // Return true for even
        }
        return false;             // Otherwise return false
    }

Note that if numQuotes happens to be 0 this method still returns true (because 0 modulo any number is 0, so (count % 2 == 0) will be true) though you wouldn't want to go ahead with the parsing if there are no " characters, so you'd want to check for this condition somewhere.

Hope this helps!


Someone else asked a similar question about how to find "interesting" words in a corpus of text. You should read the answers. In particular, Bolo's answer points to an interesting article which uses the density of appearance of a word to decide how important it is---using the observation that when a text talks about something, it usually refers to that something fairly often. This article is interesting because the technique does not require prior knowledge on the text that is being processed (for instance, you don't need a dictionary targeted at the specific lexicon).

The article suggests two algorithms.

The first algorithm rates single words (such as "Federation", or "Trek", etc.) according to their measured importance. It is straightforward to implement, and I could even provide a (not very elegant) implementation in Python.

The second algorithm is more interesting as it extracts noun phrases (such as "Star Trek", etc.) by completely ignoring whitespace and using a tree-structure to decide how to split noun phrases. The results given by this algorithm when applied to Darwin's seminal text on evolution are very impressive. However, I admit implementing this algorithm would take a bit more thought as the description given by the article is fairly elusive, and what more the authors seem a bit difficult to track down. That said, I did not spend much time, so you may have better luck.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜