开发者

How to find proper nouns in string?

I'm trying to identify proper nouns in a user-submitted 3-4 sentence paragraph. I'm OK with the function being flawed somewhat as I have a team of moderators validating just about everything.

An example of an incoming paragraph is below.

Nick Swisher homered off James Shields to key a five-run burst in the first inning and the New York Yankees beat Tampa Bay 8-3 on Tuesday night, open开发者_Python百科ing a 2 1/2-game lead over the Rays in the AL East.

I'd like the function to take the following keywords/proper-nouns out.

Nick Swisher, James Shields, New York Yankees, Tampa Bay, Rays, AL East

I'm thinking I could explode the string and seperate the words by spaces. Then I'd check each word to see if the first letter is capitalized. If it is, return it. If not, move on to the next word.

But what about multi-word keywords/proper-nouns? How do I get the function to check the word after a already found first letter capped word?

So the function would find Nick but how do I tell it to check the next word, too? So check if next is capped and if so, return Nick Swisher. If not, just return Nick.

And going one further, what if it's a 3 word phrase? New is found, York is found, how do I get it to find Yankees, too?


Try a regex like theese:

[A-Z]{1,1}[a-z]*([\s][A-Z]{1,1}[a-z]*)*

But make sure to check case sensitive


I don't think you can rely on capitalization. Even if you don't need to work with languages other than English (e.g. German capitalizes all nouns), a considerable percentage of users does not capitalize at all, or not consistently.

I suspect that any attempt to do this based on syntactic rules will fail - your problems with 3 word combinations points towards that. The real problem is that you probably can't find a useful, non-ambiguous syntactic definition of what exactly a "proper noun" is.

A different way to approach it would be to work with a list of known proper nouns (city names, given names, family names) and assume that if you find two or more of them separated only by spaces, it's a compound noun.


I had used this service, Open Calais sometime ago for a project. Might work for you. You will have to write a simple script to upload your text to the server. Check their API for how to configure etc


you generally can't do something like this, not easily.

what if he forgot to capitalize a proper noun? How about "Thursday"? What about the sentence: "Only I. This person."?

The easiest way is probably by detecting capital letters, and a run of capitals will be considered as proper nouns. The hardest way involves (linguistic) syntax analysis of English sentences, which is difficult to do.


This will match words starting with uppercase letters, and even multiple succeeding words:

$text = 'Nick Swisher homered off James Shields to key a five-run burst in the first inning and the New York Yankees beat Tampa Bay 8-3 on Tuesday night, opening a 2 1/2-game lead over the Rays in the AL East.';

$matches= array();
preg_match_all('/([[:upper:]]+[[:lower:]]*(\W|$))+/', $text, $matches);
print_r($matches);

Note though that the strings in $matches[0] all end in the characters found in $matches[2]. This can easily be solved by a foreach cleanup statement, or maybe by modifying the regex.


Here's a script which when run on your paragraph produces an array with the following values:

Array ( [0] => Nick Swisher [1] => James Shields [2] => New York Yankees [3] => Tampa Bay [4] => Tuesday [5] => Rays [6] => AL East. )

Is this helpful?

$proper_nouns = Array();
$words = explode(' ', $paragraph);
for ($i = 0; $i < count($words); $i++) {
    if (preg_match('/[A-Z]/', $words[$i]) > 0) {
        $proper_noun = $words[$i];
        $index = 1;
        while (true) {
            if ($i + $index < count($words)) {
                if (preg_match('/[A-Z]/', $words[$i + $index]) > 0) {
                    $proper_noun = $proper_noun." ".$words[$i + $index];
                    $index++;
                }
                else {
                    $i = $i + $index - 1;
                    break;
                }
            }
            else {
                break;
            }
        }
        array_push($proper_nouns, $proper_noun);
    }
}


Not sure what language you're working in, but here's a php class to find proper nouns. It uses a lot more than just uppercase letters. Even if you aren't using php, you can use it as a model for the language you're using. Here's the description:

Proper nouns class can find and extract proper nouns from given text using heuristics based on syntactic clues like first letter uppercased, word position in sentence, etc. It can try to combine proper nouns using conjunctions to find multiple word proper nouns. This class provides customizations so it can be applied to other languages, which grammar uses same heuristics.


If you need something more than a Regex, best way to do this is to use a natural language processor like openNLP which is built on top of Solr. http://opennlp.apache.org/

Your first step will be to install Apache Solr/Lucene. https://lucene.apache.org/solr/

You can download Solr and get up and running in a few minutes. Then install/build openNLP.

This sounds intimidating but it will give you a LOT of power and a truly scalable solution for things like proper noun extraction and much more.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜